├── .gitignore
├── README.md
├── alignment
    ├── README.md
    ├── img
    │   └── alignment.png
    ├── installation.md
    └── smithwaterman.md
├── bin
    └── update_index.py
├── core
    ├── README.md
    ├── img
    │   └── core.png
    ├── installation.md
    ├── readwrite.md
    ├── sequences.md
    └── translating.md
├── genomics
    ├── README.md
    ├── chromosomeposition.md
    ├── genebank.md
    ├── genenames.md
    ├── gff.md
    ├── img
    │   └── genomics.png
    ├── installation.md
    ├── karyotype.md
    └── twobit.md
├── installation.md
├── license.md
├── logo.png
├── modfinder
    ├── README.md
    ├── add-protein-modification.md
    ├── identify-protein-modifications.md
    ├── installation.md
    └── supported-protein-modifications.md
├── protein-disorder
    └── README.md
└── structure
    ├── README.md
    ├── alignment-data-model.md
    ├── alignment.md
    ├── asa.md
    ├── bioassembly.md
    ├── caching.md
    ├── chemcomp.md
    ├── contact-map.md
    ├── crystal-contacts.md
    ├── externaldb.md
    ├── firststeps.md
    ├── img
        ├── 143px-Selenomethionine-from-xtal-3D-balls.png
        ├── 1cfd_1cll_fatcat.png
        ├── 1cfd_1cll_fatcat.xcf
        ├── 1cfd_1cll_flexible.png
        ├── 1cfd_1cll_rigid.png
        ├── 1dan_scop.png
        ├── 1gav_asym.png
        ├── 1gav_biounit.png
        ├── 1hho_asym.png
        ├── 1hho_biounit.png
        ├── 1m4x_bio_r_250.jpg
        ├── 2hyn_1zll.png
        ├── 3cna.A_2pel.A_cecp.png
        ├── 4hhb_bio_r_250.jpg
        ├── 4hhb_jmol.png
        ├── alignment_gui.png
        ├── alignmentpanel.png
        ├── cath_1dan.png
        ├── database_search.png
        ├── database_search_results.png
        ├── multiple_gui.png
        ├── multiple_jmol_globins.png
        ├── multiple_panel_globins.png
        ├── symm_combined.png
        ├── symm_helical.png
        ├── symm_hierarchy.png
        ├── symm_internal.png
        ├── symm_local.png
        ├── symm_pg.png
        ├── symm_pseudo.png
        └── symm_subunits.png
    ├── installation.md
    ├── lists.md
    ├── mmcif.md
    ├── secstruc.md
    ├── seqres.md
    ├── special.md
    ├── structure-data-model.md
    └── symmetry.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | .profile
3 | .settings
4 | .idea


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 |  <img src="logo.png" height="30"/> Tutorial
 2 | ===
 3 | 
 4 | A brief introduction into [BioJava](https://www.biojava.org).
 5 | -----
 6 | 
 7 | The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. This tutorial is still under development, hence not yet comprehensive for the entire library. Please also check other sources of [documentation](https://biojava.org/wiki/Documentation).
 8 | 
 9 | The examples within the tutorial are intended to work with the most recent version of BioJava. Please do submit a [new issue](https://github.com/biojava/biojava-tutorial/issues) if you find any problems.
10 | 
11 | The tutorial is subdivided into several books, corresponding to the respective BioJava modules. Each book is further subdivided into several chapters that intend to describe the main functionality of the module in order of increasing complexity.
12 | 
13 | ## Index
14 | 
15 | [Quick Installation](installation.md)
16 | 
17 | Book 1: [The Core Module](core/README.md), basic working with sequences.
18 | 
19 | Book 2: [The Alignment Module](alignment/README.md), pairwise and multiple alignments of protein sequences.
20 | 
21 | Book 3: [The Structure Modules](structure/README.md), everything related to working with 3D structures.
22 | 
23 | Book 4: [The Genomics Module](genomics/README.md), working with genomic data.
24 | 
25 | Book 5: [The Protein-Disorder Module](protein-disorder/README.md), predicting protein-disorder.
26 | 
27 | Book 6: [The ModFinder Module](modfinder/README.md), identifying protein modifications in 3D structures
28 | 
29 | ## License
30 | 
31 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md).
32 | 
33 | ## Please Cite
34 | 
35 | **BioJava 5: A community driven open-source bioinformatics library**<br/>
36 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte* <br/>
37 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791) <br/>
38 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
39 | 
40 | 
41 | <!--automatically generated footer-->
42 | 


--------------------------------------------------------------------------------
/alignment/README.md:
--------------------------------------------------------------------------------
 1 | The BioJava - Alignment Module
 2 | =====================================================
 3 | 
 4 | A tutorial for the alignment module of [BioJava](http://www.biojava.org).
 5 | 
 6 | ## About
 7 | <table>
 8 |     <tr>
 9 |         <td>
10 |             <img src="img/alignment.png"/>
11 |         </td>
12 |         <td>
13 |             The <i>alignment</i> module of BioJava provides an API that contains
14 |             <ul>
15 |                 <li>Implementations of dynamic programming algorithms for sequence alignment</li>
16 |                 <li>Reading and Writing of popular alignment file formats</li>
17 |                 <li>A single-, or multi- threaded multiple sequence alignment algorithm.</li>
18 |             </ul>
19 |         </td>
20 |     </tr>
21 | </table>   
22 | 
23 | ## Index
24 | 
25 | This tutorial is split into several chapters.
26 | 
27 | Chapter 1 - Quick [Installation](installation.md)
28 | 
29 | Chapter 2 - Global alignment - Needleman and Wunsch algorithm
30 | 
31 | Chapter 3 - [Local alignment](smithwaterman.md) - Smith-Waterman algorithm
32 | 
33 | Chapter 4 - Multiple Sequence alignment
34 | 
35 | Chapter 5 - Reading and writing of multiple alignments
36 | 
37 | Chapter 6 - BLAST - why you don't need BioJava for parsing BLAST
38 | 
39 | ## License
40 | 
41 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
42 | 
43 | ## Please cite
44 | 
45 | **BioJava 5: A community driven open-source bioinformatics library**<br/>
46 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte* <br/>
47 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791) <br/>
48 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
49 | 
50 | 
51 | 
52 | <!--automatically generated footer-->
53 | 
54 | ---
55 | 
56 | Navigation:
57 | [Home](../README.md)
58 | | Book 2: The Alignment Module
59 | 
60 | Prev: [Book 1: The Core Module](../core/README.md)
61 | 
62 | Next: [Book 3: The Structure Modules](../structure/README.md)
63 | 


--------------------------------------------------------------------------------
/alignment/img/alignment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/alignment/img/alignment.png


--------------------------------------------------------------------------------
/alignment/installation.md:
--------------------------------------------------------------------------------
 1 | ## Quick Installation
 2 | 
 3 | In the beginning, just one quick paragraph of how to get access to BioJava.
 4 | 
 5 | BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way:
 6 | 
 7 | BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html)  guide.
 8 | 
 9 | As of version 4, BioJava is available in maven central. This is all you would need to add a BioJava dependency to your projects:
10 | 
11 | ```xml
12 |         <dependencies>
13 |                 ...
14 | 
15 |                  <!-- This imports the latest version of BioJava core module -->
16 |                 <dependency>
17 | 
18 |                         <groupId>org.biojava</groupId>
19 |                         <artifactId>biojava-alignment</artifactId>
20 |                         <version>4.0.0</version>
21 |                 </dependency>
22 | 
23 | 
24 |                 <!-- other biojava jars as needed -->
25 | 
26 |         </dependencies> 
27 | ```
28 | 
29 | If you run 
30 | 
31 | <pre>
32 |     mvn package
33 | </pre>
34 | 
35 |  on your project, the BioJava dependencies will be automatically downloaded and installed for you.
36 | 
37 | 
38 | <!--automatically generated footer-->
39 | 
40 | ---
41 | 
42 | Navigation:
43 | [Home](../README.md)
44 | | [Book 2: The Alignment Module](README.md)
45 | | Chapter 1 : Installation
46 | 


--------------------------------------------------------------------------------
/alignment/smithwaterman.md:
--------------------------------------------------------------------------------
 1 | Smith Waterman - Local Alignment
 2 | ################################
 3 | 
 4 | BioJava contains implementation for various protein sequence and 3D structure alignment algorithms. Here is how to run a local, Smith-Waterman, alignment of two protein sequences:
 5 | 
 6 | 
 7 | 
 8 | ```java
 9 | public static void main(String[] args) throws Exception {
10 | 
11 | 		String uniprotID1 = "P69905";
12 | 		String uniprotID2 = "P68871";
13 | 
14 | 		ProteinSequence s1 = getSequenceForId(uniprotID1);
15 | 		ProteinSequence s2 = getSequenceForId(uniprotID2);
16 | 
17 | 		SubstitutionMatrix<AminoAcidCompound> matrix = SubstitutionMatrixHelper.getBlosum65();
18 | 
19 | 		GapPenalty penalty = new SimpleGapPenalty();
20 | 
21 | 		int gop = 8;
22 | 		int extend = 1;
23 | 		penalty.setOpenPenalty(gop);
24 | 		penalty.setExtensionPenalty(extend);
25 | 
26 | 
27 | 		PairwiseSequenceAligner<ProteinSequence, AminoAcidCompound> smithWaterman =
28 | 				Alignments.getPairwiseAligner(s1, s2, PairwiseSequenceAlignerType.LOCAL, penalty, matrix);
29 | 
30 | 		SequencePair<ProteinSequence, AminoAcidCompound> pair = smithWaterman.getPair();
31 | 
32 | 
33 | 		System.out.println(pair.toString(60));
34 | 
35 | 
36 | 	}
37 | 
38 | 	private static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
39 | 		URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId));
40 | 		ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
41 | 		System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader());
42 | 		System.out.println();
43 | 
44 | 		return seq;
45 | 	}
46 | ```
47 | 


--------------------------------------------------------------------------------
/bin/update_index.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """
  3 | This script generates the footers for all markdown files. Rerun the script
  4 | after adding new books or chapters in order to update the footer sections on
  5 | each page with links to the next and previous chapters.
  6 | 
  7 | The script works by recursively parsing "## Index" sections in files, starting
  8 | with README.md. The footer is marked with an HTML comment, `automatically
  9 | generated footer`.  Any text after this comment is destroyed by the script, so
 10 | all edits should be made above that point.
 11 | 
 12 | """
 13 | 
 14 | import sys,os,re
 15 | 
 16 | class TutorialIndex(object):
 17 | 
 18 |     footermark = u"<!--automatically generated footer-->"
 19 | 
 20 |     def __init__(self,link,chapter=None,title=None,parent=None):
 21 |         """Create a new TutorialIndex
 22 | 
 23 |         :param link:    A link to this page, relative to parent's link
 24 |         :param chapter: The chapter number, e.g. "Chapter 5"
 25 |         :param title:   The chapter title, e.g. "Writing Docstrings"
 26 |         :param parent:  The TutorialIndex which references this one
 27 |         """
 28 |         self.link = link
 29 |         self.chapter = chapter
 30 |         self.title = title
 31 |         self.parent = parent
 32 |         self.children = []
 33 | 
 34 |     def parse(self):
 35 |         """Parse the index, add a footer, and do the same for each child
 36 |         found in the '# Index' section (if any).
 37 | 
 38 |         Caution, this method overwrites any existing footer sections on this
 39 |         file and all children!
 40 |         """
 41 | 
 42 |         #Recognise and parse "<CHAPTER>: (<TITLE>)[<LINK>]"
 43 |         indexentry = re.compile("^(.*)[:-].*\[([^]]*)\]\(([^)]*)\).*$")
 44 | 
 45 |         filename = self.rootlink()
 46 | 
 47 |         with open(filename,"r+") as file:
 48 |             line = file.readline()
 49 |             had_footer=False
 50 | 
 51 |             # Parse file for index, truncate prior footer, and append footermark
 52 |             in_index = False
 53 |             while line:
 54 |                 if line[0] == u"#": #That's a header, not a comment
 55 |                     if u"index" in line.lower():
 56 |                         in_index = True
 57 |                     else:
 58 |                         in_index = False
 59 |                 elif line.strip() == TutorialIndex.footermark: # Footer already!
 60 |                     had_footer=True
 61 |                     file.truncate()
 62 |                     break
 63 |                 elif in_index:
 64 |                     # look for 'Chapter 1: [Title](link)'
 65 |                     result = indexentry.match(line)
 66 |                     if result:
 67 |                         chapter,title,link = result.groups()
 68 |                         child = TutorialIndex(link,chapter,title,self)
 69 |                         self.children.append(child)
 70 | 
 71 |                 line = file.readline()
 72 | 
 73 |             # Append footer
 74 |             if not had_footer:
 75 |                 file.write(u"\n")
 76 |                 file.write(TutorialIndex.footermark)
 77 |                 file.write(u"\n")
 78 |             footer = self.makefooter()
 79 |             file.write(footer)
 80 | 
 81 |         # Recurse to children
 82 |         for child in self.children:
 83 |             child.parse()
 84 | 
 85 |     def rootlink(self):
 86 |         """Convert self.link to an absolute path relative to the root TutorialIndex
 87 |         :return: The path to this TutorialIndex relative to the root index
 88 |         """
 89 |         if self.parent is None:
 90 |             return self.link
 91 |         parentlink = self.parent.rootlink()
 92 | 
 93 |         return os.path.join(os.path.dirname(parentlink),self.link)
 94 | 
 95 |     def makefooter(self):
 96 |         """ makefooter() -> str
 97 | 
 98 |         Creates the footer text (everything below the "automatically generated
 99 |         footer" line)
100 |         """
101 |         # Don't include footer on main page
102 |         if self.parent is None:
103 |             return ""
104 | 
105 |         lines = ["","---","","Navigation:"]
106 |         # Iterate over parents
107 |         p = self.parent
108 |         linkmd = [self.makename()] #reverse order (self to root)
109 |         while p is not None:
110 |             name = p.makename()
111 |             # Get a path to p relative to our own path
112 |             link = os.path.relpath(p.rootlink(),os.path.dirname(self.rootlink()))
113 |             linkmd.append("[{0}]({1})".format(name,link))
114 |             p = p.parent
115 |         linkmd.reverse()
116 |         lines.append("\n| ".join(linkmd))
117 | 
118 |         lines.append("")
119 | 
120 |         if self.parent is not None:
121 |             pos = self.parent.children.index(self) #Should always work
122 |             if pos > 0:
123 |                 prev = self.parent.children[pos-1]
124 |                 name = prev.makename()
125 |                 link = os.path.relpath(prev.rootlink(),os.path.dirname(self.rootlink()))
126 |                 lines.append("Prev: [{0}]({1})".format(name,link))
127 |                 lines.append("")
128 |             if pos < len(self.parent.children)-1:
129 |                 next = self.parent.children[pos+1]
130 |                 name = next.makename()
131 |                 link = os.path.relpath(next.rootlink(),os.path.dirname(self.rootlink()))
132 |                 lines.append("Next: [{0}]({1})".format(name,link))
133 |                 lines.append("")
134 | 
135 |         #lines.append(self.makename()+", "+self.link)
136 |         return "\n".join(lines)
137 | 
138 |     def makename(self):
139 |         """ Return a name, like "<CHAPTER>: <TITLE>"
140 |         """
141 |         if self.chapter:
142 |             name = self.chapter
143 |             if self.title:
144 |                 name += ": " + self.title
145 |         elif self.title:
146 |             name = self.title
147 |         else:
148 |             name = self.link #last resort
149 | 
150 |         return name
151 | 
152 |     def __repr__(self):
153 |         return "TutorialIndex({self.link!r},{self.chapter!r},{self.title!r},{parent!r})" \
154 |                 .format(self=self,parent=self.parent.title if self.parent else None)
155 | 
156 | if __name__ == "__main__":
157 |     # Set root index
158 |     root = TutorialIndex("README.md",title="Home")
159 | 
160 |     # Rewrite headers
161 |     root.parse()
162 | 
163 |     # Output tree
164 |     def pr(node,indent=""):
165 |         print "{0}{1}".format(indent,node.link,node.rootlink())
166 |         for n in node.children:
167 |             pr(n,indent+"  ")
168 | 
169 |     pr(root)
170 | 


--------------------------------------------------------------------------------
/core/README.md:
--------------------------------------------------------------------------------
 1 | The BioJava - Core Module
 2 | =====================================================
 3 | 
 4 | A tutorial for the core module of [BioJava](http://www.biojava.org).
 5 | 
 6 | ## About
 7 | <table>
 8 |     <tr>
 9 |         <td>
10 |             <img src="img/core.png"/>
11 |         </td>
12 |         <td>
13 |             The <i>core</i> module of BioJava provides an API that provides
14 |             <ul>
15 |                 <li>Basic operations with biological sequences</li>
16 |                 <li>Reading and Writing of popular sequence file formats</li>
17 |                 <li>Translate DNA sequences into protein sequences</li>                
18 |             </ul>
19 |         </td>
20 |     </tr>
21 | </table>   
22 | 
23 | ## Index
24 | 
25 | This tutorial is split into several chapters.
26 | 
27 | Chapter 1 - Quick [Installation](installation.md)
28 | 
29 | Chapter 2 - [Basic Sequence types](sequences.md)
30 | 
31 | Chapter 3 - [Reading and Writing sequences](readwrite.md)
32 | 
33 | Chapter 4 - [Translating](translating.md) DNA and protein sequences.
34 | 
35 | ## License
36 | 
37 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
38 | 
39 | ## Please Cite
40 | 
41 | **BioJava 5: A community driven open-source bioinformatics library**<br/>
42 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte* <br/>
43 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791) <br/>
44 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
45 | 
46 | 
47 | 
48 | <!--automatically generated footer-->
49 | 
50 | ---
51 | 
52 | Navigation:
53 | [Home](../README.md)
54 | | Book 1: The Core Module
55 | 
56 | Next: [Book 2: The Alignment Module](../alignment/README.md)
57 | 


--------------------------------------------------------------------------------
/core/img/core.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/core/img/core.png


--------------------------------------------------------------------------------
/core/installation.md:
--------------------------------------------------------------------------------
 1 | ## Quick Installation
 2 | 
 3 | In the beginning, just one quick paragraph of how to get access to BioJava.
 4 | 
 5 | BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way:
 6 | 
 7 | BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html)  guide.
 8 | 
 9 | 
10 | As of version 4, BioJava is available in maven central. This is all you would need to add a BioJava dependency to your projects:
11 | 
12 | ```xml
13 |         <dependencies>
14 |                 ...
15 | 
16 |                  <!-- This imports the latest version of BioJava core module -->
17 |                 <dependency>
18 | 
19 |                         <groupId>org.biojava</groupId>
20 |                         <artifactId>biojava-core</artifactId>
21 |                         <version>4.0.0</version>
22 |                 </dependency>
23 | 
24 | 
25 |                 <!-- other biojava jars as needed -->
26 | 
27 |         </dependencies> 
28 | ```
29 | 
30 | If you run 
31 | 
32 | <pre>
33 |     mvn package
34 | </pre>
35 | 
36 |  on your project, the BioJava dependencies will be automatically downloaded and installed for you.
37 | 
38 | 
39 | <!--automatically generated footer-->
40 | 
41 | ---
42 | 
43 | Navigation:
44 | [Home](../README.md)
45 | | [Book 1: The Core Module](README.md)
46 | | Chapter 1 : Installation
47 | 
48 | Next: [Chapter 2 : Basic Sequence types](sequences.md)
49 | 


--------------------------------------------------------------------------------
/core/readwrite.md:
--------------------------------------------------------------------------------
  1 | Reading and Writing of Basic sequence file formats
  2 | ==================================================
  3 | 
  4 | 
  5 | TODO: needs more examples
  6 | 
  7 | 
  8 | ## FASTA
  9 | 
 10 | A quick way of parsing a FASTA file is using the FastaReaderHelper class. 
 11 | 
 12 | Here an example that parses a UniProt FASTA file into a protein sequence.
 13 | 
 14 | ```java
 15 | public static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
 16 | 		URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId));
 17 | 		ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
 18 | 		System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader());
 19 | 		System.out.println();
 20 | 
 21 | 		return seq;
 22 | 	}
 23 | ```
 24 | 
 25 | 
 26 | BioJava can also be used to parse large FASTA files. The example below can parse a 1GB (compressed) version of TREMBL with standard memory settings.
 27 | 
 28 | 
 29 | ```java
 30 |     
 31 |     
 32 |     
 33 |      /** Download a large file, e.g. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
 34 |      * and pass in path to local location of file
 35 |      *
 36 |      * @param args
 37 |      */
 38 |         public static void main(String[] args) {
 39 | 
 40 |             if ( args.length < 1) {
 41 |                 System.err.println("First argument needs to be path to fasta file");
 42 |                 return;
 43 |             }
 44 | 
 45 |             File f = new File(args[0]);
 46 | 
 47 |             if ( ! f.exists()) {
 48 |                 System.err.println("File does not exist " + args[0]);
 49 |                 return;
 50 |             }
 51 | 
 52 |             try {
 53 | 
 54 |                 // automatically uncompresses files using InputStreamProvider
 55 |                 InputStreamProvider isp = new InputStreamProvider();
 56 |                 
 57 |                 InputStream inStream = isp.getInputStream(f);
 58 |                 
 59 |                 FastaReader<ProteinSequence, AminoAcidCompound> fastaReader = new FastaReader<ProteinSequence, AminoAcidCompound>(
 60 |                         inStream,
 61 |                         new GenericFastaHeaderParser<ProteinSequence, AminoAcidCompound>(),
 62 |                         new ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()));
 63 |                 
 64 |                 LinkedHashMap<String, ProteinSequence> b;
 65 | 
 66 | 
 67 |                 int nrSeq = 0;
 68 |                 
 69 |                 while ((b = fastaReader.process(10)) != null) {
 70 |                     for (String key : b.keySet()) {
 71 |                         nrSeq++;
 72 |                         System.out.println(nrSeq + " : " + key + " " + b.get(key));
 73 |                     }
 74 | 
 75 |                 }
 76 |             } catch (Exception ex) {
 77 |                 Logger.getLogger(ParseFastaFileDemo.class.getName()).log(Level.SEVERE, null, ex);
 78 |             }
 79 |         }
 80 | ```
 81 | 
 82 | BioJava can also process large FASTA files using the Java streams API.
 83 | 
 84 | ```java
 85 |     FastaStreamer
 86 |         .from(path)
 87 |         .stream()
 88 |         .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString()));
 89 | ```
 90 | 
 91 | If you need to specify a header parser other that `GenericFastaHeaderParser` or a sequence creater other than a
 92 | `ProteinSequenceCreator`, these can be specified before streaming the contents as follows:
 93 | 
 94 | ```java
 95 |     FastaStreamer
 96 |        .from(path)
 97 |        .withHeaderParser(new PlainFastaHeaderParser<>())
 98 |        .withSequenceCreator(new CasePreservingProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()))
 99 |        .stream()
100 |        .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString()));
101 | ```
102 | 
103 | 
104 | 
105 | <!--automatically generated footer-->
106 | 
107 | ---
108 | 
109 | Navigation:
110 | [Home](../README.md)
111 | | [Book 1: The Core Module](README.md)
112 | | Chapter 3 : Reading and Writing sequences
113 | 
114 | Prev: [Chapter 2 : Basic Sequence types](sequences.md)
115 | 
116 | Next: [Chapter 4 : Translating](translating.md)
117 | 


--------------------------------------------------------------------------------
/core/sequences.md:
--------------------------------------------------------------------------------
 1 | Sequences in BioJava
 2 | =====================
 3 | 
 4 | BioJava supports a number of basic biological sequence types: DNA, RNA, and protein sequences.
 5 | 
 6 | ## Create a basic sequence object
 7 | 
 8 | Create a DNA sequence
 9 | 
10 | ```java    
11 |     DNASequence seq = new DNASequence("GTAC"); 
12 | ```   
13 | 
14 | In addition to the basic DNA sequence class there are specialized classes that extend DNASequence: 
15 | ChromosomeSequence, GeneSequence, IntronSequence, ExonSequence, TranscriptSequence
16 | 
17 | Create a RNA sequence
18 | 
19 | ```java    
20 |     RNASequence seq = new RNASequence("GUAC"); 
21 | ```   
22 | 
23 | Create a protein sequence
24 | 
25 | ```java    
26 |     ProteinSequence seq = new ProteinSequence("MSTNPKPQRKTKRNTNRRPQDVKFPGG"); 
27 | ```   
28 | 
29 | ## Ambiguity codes
30 | 
31 | In particular when dealing with nucleotide sequences, sometimes the exact nucleotides are not known. 
32 | BioJava supports standard conventions for dealing with such ambiguity. 
33 | For example to represent the nucleotides "A or T" often "W" is getting used.
34 | The expected set of compounds in a sequence by default is strict, however it takes only one line of code to switch to supporting
35 | ambiguity codes.
36 | 
37 | 
38 | ```java            
39 |         // this throws an error
40 |         DNASequence dna2 = new DNASequence("WWW");
41 | 
42 |         // however this works:
43 |         AmbiguityDNACompoundSet ambiguityDNACompoundSet = AmbiguityDNACompoundSet.getDNACompoundSet();
44 |         DNASequence dna2 = new DNASequence("WWW",ambiguityDNACompoundSet);
45 | ```   
46 | 
47 | 
48 | ## Protein sequences and ambiguity
49 | The default AminoAcidCompoundSet already supports "Asparagine or Aspartic acid" and related ambiguities. 
50 | It also contains support for Selenocysteine and Pyrrolysine
51 | 
52 | 
53 | 
54 | ## More details 
55 | 
56 | See the Cookbook for [more details on dealing with sequences] (http://biojava.org/wiki/BioJava:CookBook:Core:Overview)
57 | <!--automatically generated footer-->
58 | 
59 | ---
60 | 
61 | Navigation:
62 | [Home](../README.md)
63 | | [Book 1: The Core Module](README.md)
64 | | Chapter 2 : Basic Sequence types
65 | 
66 | Prev: [Chapter 1 : Installation](installation.md)
67 | 
68 | Next: [Chapter 3 : Reading and Writing sequences](readwrite.md)
69 | 


--------------------------------------------------------------------------------
/core/translating.md:
--------------------------------------------------------------------------------
  1 | Translating RNA and protein sequences
  2 | =====================================
  3 | 
  4 | 
  5 | An example for how to parse a sequence from a String and using the Translation engine to convert into amino acid sequence. 
  6 | 
  7 | ```java    
  8 |   String dnaFastaS = ">gb:GQ903697|Organism:Arenavirus H0030026 H0030026|Segment:S|Host:Rat\n" +
  9 |                 "CGCACAGAGGATCCTAGGCGTTACTGACTTGCGCTAATAACAGATACTGTTTCATATTTAGATAAAGACC\n" +
 10 |                 "CAGCCAACTGATTGGTCAGCATGGGACAACTTGTGTCCCTCTTCAGTGAAATTCCATCAATCATACACGA\n" +
 11 |                 "AGCTCTCAATGTTGCTCTCGTAGCTGTTAGCATCATTGCAATATTGAAAGGGGTTGTGAATGTTTGGAAG\n" +
 12 |                 "AGTGGAGTTTTGCAGCTTTTGGCCTTCTTGCTCCTGGCGGGAAGATCCTGCTCAGTCATAATTGGTCATC\n" +
 13 |                 "ATCTCGAACTGCAGCATGTGATCTTCAATGGGTCATCAATCACACCCTTTTTACCAGTTACATGTAAGAT\n" +
 14 |                 "CAATGATACCTACTTCCTACTAAGAGGCCCCTATGAAGCTGATTGGGCAGTTGAATTGAGTGTAACTGAA\n" +
 15 |                 "ACCACAGTCTTGGTTGATCTTGAAGGTGGCAGCTCAATGAAGCTGAAAGCCGGAAACATCTCAGGTTGTC\n" +
 16 |                 "TTGGAGACAACCCCCATCTGAGATCAGTGGTCTTCACATTGAATTGGTTGCTAACAGGATTAGATCATGT\n" +
 17 |                 "TATTGATTCTGACCCGAAAATTCTCTGTGATCTTAAAGACAGTGGGCACTTTCGTCTCCAGATGAACTTA\n" +
 18 |                 "ACAGAAAAGCACTATTGTGACAAGTTTCACATCAAAATGGGCAAGGTCTTTGGCGTATTCAAAGATCCGT\n" +
 19 |                 "GCATGGCTGGTGGTAAAATGTTTGCCATACTAAAAAATACCTCTTGGTCGAACCAGTGCCAAGGAAACCA\n" +
 20 |                 "TGTCAGCACCATTCATCTTGTCCTTCAGAGTAATTTCAAACAGGTCCTCAGTAGCAGGAAACTGTTGAAC\n" +
 21 |                 "TTTTTCAGCTGGTCATTGTCTGATGCCACAGGGGCTGATATGCCTGGTGGTTTTTGTCTGGAAAAATGGA\n" +
 22 |                 "TGTTGATTTCAAGTGAACTGAAATGCTTTGGAAACACAGCTGTGGCAAAGTGCAACTTAAATCATGACTC\n" +
 23 |                 "AGAGTTCTGTGACATGCTTAGGCTTTTTGATTTCAACAAAAAGGCAATAGTCACTCTTCAGAACAAAACA\n" +
 24 |                 "AAGCATCGGCTGGACACAGTAATTACTGCTATCAATTCATTGATCTCTGATAATATTCTTATGAAGAACA\n" +
 25 |                 "GGATTAAAGAATTGATAGATGTTCCTTACTGTAATTACACCAAATTTTGGTATGTCAATCACACAGGTCT\n" +
 26 |                 "AAATCTGCACACCCTTCCAAGATGTTGGCTTGTTAAAAATGGTAGCTACTTGAATGTGTCTGACTTCAGG\n" +
 27 |                 "AATGAGTGGATATTGGAGAGTGATCATCTTGTTTCGGAGATCCTTTCAAAGGAGTATGAGGAAAGGCAAA\n" +
 28 |                 "ATCGTACACCACTCTCACTGGTTGACATCTGTTTCTGGAGTACATTGTTTTACACAGCATCAATTTTCCT\n" +
 29 |                 "ACACCTCTTGAGAATTCCAACCCACAGACACATTGTTGGTGAGGGCTGCCCGAAGCCTCATAGGCTAAAC\n" +
 30 |                 "AGGCACTCAATATGTGCTTGTGGCCTTTTCAAACAAGAAGGCAGACCCTTGAGATGGGTAAGAAAGGTGT\n" +
 31 |                 "GAACAATGGTTGCTTGGTGGCCTCCATTGCTGCACCCCCCTAGGGGGGTGCAGCAATGGAGGTTCTCGYT\n" +
 32 |                 "GAGCCTAGAGAACAACTGTTGAATCGGGTTCTCTAAAGAGAACATCGATTGGTAGTACCCTTTTTGGTTT\n" +
 33 |                 "TTCATTGGTCACTGACCCTGAAAGCACAGCACTGAACATCAAACAGTCCAAAAGTGCACAGTGTGCATTT\n" +
 34 |                 "GTTGTGGCTGGTGCTGATCCTTTCTTCTTACTTTTAATGACTATTCCCTTATGTCTGTCACACAGATGTT\n" +
 35 |                 "CAAATCTCTTCCAAACAAGATCTTCAAAGAGCCGTGACTGTTCTGCGGTCAGTTTGACATCAACAATCTT\n" +
 36 |                 "CAAATCCTGTCTTCCATGCATATCAAAGAGCCTCCTAATATCATCAGCACCTTGCGCAGTGAAAACCATG\n" +
 37 |                 "GATTTAGGCAGACTCCTTATTATGCTTGTGATGAGGCCAGGTCGTGCATGTTCAACATCCTTCAGCAATA\n" +
 38 |                 "TCCCATGACAATATTTACTTTGGTCCTTAAAAGATTTTATGTCATTGGGTTTTCTGTAGCAGTGGATGAA\n" +
 39 |                 "TTTTTGTGATTCAGGCTGGTAAATTGCAAACTCAACAGGGTCATGTGGCGGGCCTTCAATGTCAATCCAT\n" +
 40 |                 "GTTGTGTCACTGACCATCAACGACTCTACACTTCTCTTCACCTGAGCCTCCACCTCAGGCTTGAGCGTGG\n" +
 41 |                 "ACAAGAGTGGGGCACCACCGTTCCGGATGGGGACTGGTGTTTTGCTTGGTAAACTCTCAAATTCCACAAC\n" +
 42 |                 "TGTATTGTCCCATGCTCTCCCTTTGATCTGTGATCTTGATGAAATGTAAGGCCAGCCCTCACCAGAGAGA\n" +
 43 |                 "CACACCTTATAAAGTATGTTTTCATAAGGATTCCTCTGTCCTGGTATGGCACTGATGAACATGTTTTCCC\n" +
 44 |                 "TCTTTTTGATCTCCAAGAGGGTTTTTATAATGGTTGTGAATGTGGACTCCTCAATCTTTATTGTTTCCAG\n" +
 45 |                 "CATGTTGCCACCATCAATCAGGCAAGCACCGGCTTTCACAGCAGCTGATAAACTAAGGTTGTAGCCTGAT\n" +
 46 |                 "ATGTTAATTTGAGAATCCTCCTGAGTGATTACCTTTAGAGAAGGATGCTTCTCCATCAAAGCATCTAAGT\n" +
 47 |                 "CACTTAAATTAGGGTATTTTGCTGTGTATAGCAACCCCAGATCTGTGAGGGCCTGAACCACATCATTTAG\n" +
 48 |                 "AGTTTCCCCTCCCTGTTCAGTCATACAGGAAATTGTGAGTGCTGGCATCGATCCAAATTGGTTGATCATA\n" +
 49 |                 "AGTGATGAGTCTTTAACGTCCCAGACTTTGACCACCCCTCCAGTTCTAGCCAACCCAGGTCTCTGAATAC\n" +
 50 |                 "CAACAAGTTGCAGAATTTCGGACCTCCTGGTGAGCTGTGTTGTAGAGAGGTTCCCTAGATACTGGCCACC\n" +
 51 |                 "TGTGGCTGTCAACCTCTCTGTTCTTTGAACTTTTTGCCTTAATTTGTCCAAGTCACTGGAGAGTTCCATT\n" +
 52 |                 "AGCTCTTCCTTTGACAATGATCCTATCTTAAGGAACATGTTCTTTTGGGTTGACTTCATGACCATCAATG\n" +
 53 |                 "AGTCAACTTCCTTATTCAAGTCCCTCAAACTAACAAGATCACTGTCATCTCTTTTAGACCTCCTCATCAT\n" +
 54 |                 "GCGTTGCACACTTGCAACCTTTGAAAAATCTAAGCCGGACAGAAGAGCCCTCGCGTCAGTTAGGACATCT\n" +
 55 |                 "GCCTTAACAGCAGTTGTCCAGTTCGAGAGTCCTCTCCTGAGAGACTGTGTCCATCTGAATGATGGGATTG\n" +
 56 |                 "GTTGTTCGCTCATAGTGATGAAATTGCGCAGAGTTATCCAAAAGCCTAGGATCCTCTGTGCG";
 57 | 
 58 | 
 59 |             try {
 60 | 
 61 |                 // parse the raw sequence from the string
 62 |                 InputStream stream = new ByteArrayInputStream(dnaFastaS.getBytes());
 63 | 
 64 |                 // define the Ambiguity Compound Sets
 65 |                 AmbiguityDNACompoundSet ambiguityDNACompoundSet = AmbiguityDNACompoundSet.getDNACompoundSet();
 66 |                 CompoundSet<NucleotideCompound>  nucleotideCompoundSet = AmbiguityRNACompoundSet.getRNACompoundSet();
 67 | 
 68 |                 FastaReader<DNASequence, NucleotideCompound> proxy =
 69 |                         new FastaReader<DNASequence, NucleotideCompound>(
 70 |                                 stream,
 71 |                                 new GenericFastaHeaderParser<DNASequence, NucleotideCompound>(),
 72 |                                 new DNASequenceCreator(ambiguityDNACompoundSet));
 73 | 
 74 |                 // has only one entry in this example, but could be easily extended to parse a FASTA file with multiple sequences
 75 |                 LinkedHashMap<String, DNASequence> dnaSequences = proxy.process();
 76 | 
 77 |                 // Initialize the Transcription Engine
 78 |                 TranscriptionEngine engine = new
 79 |                         TranscriptionEngine.Builder().dnaCompounds(ambiguityDNACompoundSet).rnaCompounds(nucleotideCompoundSet).build();
 80 | 
 81 |                 Frame[] sixFrames = Frame.getAllFrames();
 82 | 
 83 |                 for (DNASequence dna : dnaSequences.values()) {
 84 | 
 85 |                     Map<Frame, Sequence<AminoAcidCompound>> results = engine.multipleFrameTranslation(dna, sixFrames);
 86 | 
 87 |                     for (Frame frame : sixFrames){
 88 |                         System.out.println("Translated Frame:" + frame +" : " + results.get(frame));
 89 |                     }
 90 | 
 91 |                 }
 92 |             } catch (Exception e){
 93 |                 e.printStackTrace();
 94 |         }
 95 | ```  
 96 | 
 97 | This code will print out:
 98 | 
 99 | ```
100 | Translated Frame:ONE : RTEDPRRY*LALITDTVSYLDKDPAN*LVSMGQLVSLFSEIPSIIHEALNVALVAVSIIAILKGVVNVWKSGVLQLLAFLLLAGRSCSVIIGHHLELQHVIFNGSSITPFLPVTCKINDTYFLLRGPYEADWAVELSVTETTVLVDLEGGSSMKLKAGNISGCLGDNPHLRSVVFTLNWLLTGLDHVIDSDPKILCDLKDSGHFRLQMNLTEKHYCDKFHIKMGKVFGVFKDPCMAGGKMFAILKNTSWSNQCQGNHVSTIHLVLQSNFKQVLSSRKLLNFFSWSLSDATGADMPGGFCLEKWMLISSELKCFGNTAVAKCNLNHDSEFCDMLRLFDFNKKAIVTLQNKTKHRLDTVITAINSLISDNILMKNRIKELIDVPYCNYTKFWYVNHTGLNLHTLPRCWLVKNGSYLNVSDFRNEWILESDHLVSEILSKEYEERQNRTPLSLVDICFWSTLFYTASIFLHLLRIPTHRHIVGEGCPKPHRLNRHSICACGLFKQEGRPLRWVRKV*TMVAWWPPLLHPPRGVQQWRFSXSLENNC*IGFSKENIDW*YPFWFFIGH*P*KHSTEHQTVQKCTVCICCGWC*SFLLTFNDYSLMSVTQMFKSLPNKIFKEP*LFCGQFDINNLQILSSMHIKEPPNIISTLRSENHGFRQTPYYACDEARSCMFNILQQYPMTIFTLVLKRFYVIGFSVAVDEFL*FRLVNCKLNRVMWRAFNVNPCCVTDHQRLYTSLHLSLHLRLERGQEWGTTVPDGDWCFAW*TLKFHNCIVPCSPFDL*S**NVRPALTRETHLIKYVFIRIPLSWYGTDEHVFPLFDLQEGFYNGCECGLLNLYCFQHVATINQASTGFHSS**TKVVA*YVNLRILLSDYL*RRMLLHQSI*VT*IRVFCCV*QPQICEGLNHII*SFPSLFSHTGNCECWHRSKLVDHK**VFNVPDFDHPSSSSQPRSLNTNKLQNFGPPGELCCREVP*ILATCGCQPLCSLNFLP*FVQVTGEFH*LFL*Q*SYLKEHVLLG*LHDHQ*VNFLIQVPQTNKITVISFRPPHHALHTCNL*KI*AGQKSPRVS*DICLNSSCPVRESSPERLCPSE*WDWLFAHSDEIAQSYPKA*DPLC
101 | Translated Frame:TWO : AQRILGVTDLR**QILFHI*IKTQPTDWSAWDNLCPSSVKFHQSYTKLSMLLS*LLASLQY*KGL*MFGRVEFCSFWPSCSWREDPAQS*LVIISNCSM*SSMGHQSHPFYQLHVRSMIPTSY*EAPMKLIGQLN*V*LKPQSWLILKVAAQ*S*KPETSQVVLETTPI*DQWSSH*IGC*QD*IMLLILTRKFSVILKTVGTFVSR*T*QKSTIVTSFTSKWARSLAYSKIRAWLVVKCLPY*KIPLGRTSAKETMSAPFILSFRVISNRSSVAGNC*TFSAGHCLMPQGLICLVVFVWKNGC*FQVN*NALETQLWQSAT*IMTQSSVTCLGFLISTKRQ*SLFRTKQSIGWTQ*LLLSIH*SLIIFL*RTGLKN**MFLTVITPNFGMSITQV*ICTPFQDVGLLKMVAT*MCLTSGMSGYWRVIILFRRSFQRSMRKGKIVHHSHWLTSVSGVHCFTQHQFSYTS*EFQPTDTLLVRAARSLIG*TGTQYVLVAFSNKKADP*DG*ERCEQWLLGGLHCCTPLGGCSNGGSX*A*RTTVESGSLKRTSIGSTLFGFSLVTDPESTALNIKQSKSAQCAFVVAGADPFFLLLMTIPLCLSHRCSNLFQTRSSKSRDCSAVSLTSTIFKSCLPCISKSLLISSAPCAVKTMDLGRLLIMLVMRPGRACSTSFSNIP*QYLLWSLKDFMSLGFL*QWMNFCDSGW*IANSTGSCGGPSMSIHVVSLTINDSTLLFT*ASTSGLSVDKSGAPPFRMGTGVLLGKLSNSTTVLSHALPLICDLDEM*GQPSPERHTL*SMFS*GFLCPGMALMNMFSLFLISKRVFIMVVNVDSSIFIVSSMLPPSIRQAPAFTAADKLRL*PDMLI*ESS*VITFREGCFSIKASKSLKLGYFAVYSNPRSVRA*TTSFRVSPPCSVIQEIVSAGIDPNWLIISDESLTSQTLTTPPVLANPGL*IPTSCRISDLLVSCVVERFPRYWPPVAVNLSVL*TFCLNLSKSLESSISSSFDNDPILRNMFFWVDFMTINESTSLFKSLKLTRSLSSLLDLLIMRCTLATFEKSKPDRRALASVRTSALTAVVQFESPLLRDCVHLNDGIGCSLIVMKLRRVIQKPRILCA
102 | Translated Frame:THREE : HRGS*ALLTCANNRYCFIFR*RPSQLIGQHGTTCVPLQ*NSINHTRSSQCCSRSC*HHCNIERGCECLEEWSFAAFGLLAPGGKILLSHNWSSSRTAACDLQWVINHTLFTSYM*DQ*YLLPTKRPL*S*LGS*IECN*NHSLG*S*RWQLNEAESRKHLRLSWRQPPSEISGLHIELVANRIRSCY*F*PENSL*S*RQWALSSPDELNRKALL*QVSHQNGQGLWRIQRSVHGWW*NVCHTKKYLLVEPVPRKPCQHHSSCPSE*FQTGPQ*QETVELFQLVIV*CHRG*YAWWFLSGKMDVDFK*TEMLWKHSCGKVQLKS*LRVL*HA*AF*FQQKGNSHSSEQNKASAGHSNYCYQFIDL**YSYEEQD*RIDRCSLL*LHQILVCQSHRSKSAHPSKMLAC*KW*LLECV*LQE*VDIGE*SSCFGDPFKGV*GKAKSYTTLTG*HLFLEYIVLHSINFPTPLENSNPQTHCW*GLPEAS*AKQALNMCLWPFQTRRQTLEMGKKGVNNGCLVASIAAPP*GGAAMEVLXEPREQLLNRVL*REHRLVVPFLVFHWSLTLKAQH*TSNSPKVHSVHLLWLVLILSSYF**LFPYVCHTDVQISSKQDLQRAVTVLRSV*HQQSSNPVFHAYQRAS*YHQHLAQ*KPWI*ADSLLCL**GQVVHVQHPSAISHDNIYFGP*KILCHWVFCSSG*IFVIQAGKLQTQQGHVAGLQCQSMLCH*PSTTLHFSSPEPPPQA*AWTRVGHHRSGWGLVFCLVNSQIPQLYCPMLSL*SVILMKCKASPHQRDTPYKVCFHKDSSVLVWH**TCFPSF*SPRGFL*WL*MWTPQSLLFPACCHHQSGKHRLSQQLIN*GCSLIC*FENPPE*LPLEKDASPSKHLSHLN*GILLCIATPDL*GPEPHHLEFPLPVQSYRKL*VLASIQIG*S*VMSL*RPRL*PPLQF*PTQVSEYQQVAEFRTSW*AVL*RGSLDTGHLWLSTSLFFELFALICPSHWRVPLALPLTMILS*GTCSFGLTS*PSMSQLPYSSPSN*QDHCHLF*TSSSCVAHLQPLKNLSRTEEPSRQLGHLP*QQLSSSRVLS*ETVSI*MMGLVVRS***NCAELSKSLGSSV
103 | Translated Frame:REVERSED_ONE : RTEDPRLLDNSAQFHHYERTTNPIIQMDTVSQERTLELDNCC*GRCPN*REGSSVRLRFFKGCKCATHDEEV*KR*Q*SC*FEGLE*GS*LIDGHEVNPKEHVP*DRIIVKGRANGTLQ*LGQIKAKSSKNREVDSHRWPVSREPLYNTAHQEVRNSATCWYSETWVG*NWRGGQSLGR*RLITYDQPIWIDASTHNFLYD*TGRGNSK*CGSGPHRSGVAIHSKIP*FK*LRCFDGEASFSKGNHSGGFSN*HIRLQP*FISCCESRCLPD*WWQHAGNNKD*GVHIHNHYKNPLGDQKEGKHVHQCHTRTEESL*KHTL*GVSLW*GLALHFIKITDQRESMGQYSCGI*EFTKQNTSPHPERWCPTLVHAQA*GGGSGEEKCRVVDGQ*HNMD*H*RPAT*PC*VCNLPA*ITKIHPLLQKTQ*HKIF*GPK*ILSWDIAEGC*TCTTWPHHKHNKESA*IHGFHCARC**Y*EAL*YAWKTGFEDC*CQTDRRTVTAL*RSCLEEI*TSV*QT*GNSH*K*EERISTSHNKCTLCTFGLFDVQCCAFRVSDQ*KTKKGTTNRCSL*RTRFNSCSLGSXRTSIAAPP*GGAAMEATKQPLFTPFLPISRVCLLV*KGHKHILSACLAYEASGSPHQQCVCGLEFSRGVGKLMLCKTMYSRNRCQPVRVVYDFAFPHTPLKGSPKQDDHSPISTHS*SQTHSSSYHF*QANILEGCADLDLCD*HTKIWCNYSKEHLSIL*SCSS*EYYQRSMN**Q*LLCPADALFCSEE*LLPFC*NQKA*ACHRTLSHDLSCTLPQLCFQSISVHLKSTSIFPDKNHQAYQPLWHQTMTS*KSSTVSCY*GPV*NYSEGQDEWC*HGFLGTGSTKRYFLVWQTFYHQPCTDL*IRQRPCPF*CETCHNSAFLLSSSGDESAHCL*DHREFSGQNQ*HDLILLATNSM*RPLISDGGCLQDNLRCFRLSASLSCHLQDQPRLWFQLHSIQLPNQLHRGLLVGSRYH*SYM*LVKRV*LMTH*RSHAAVRDDDQL*LSRIFPPGARRPKAAKLHSSKHSQPLSILQ*C*QLREQH*ELRV*LMEFH*RGTQVVPC*PISWLGLYLNMKQYLLLAQVSNA*DPLC
104 | Translated Frame:REVERSED_TWO : AQRILGFWITLRNFITMSEQPIPSFRWTQSLRRGLSNWTTAVKADVLTDARALLSGLDFSKVASVQRMMRRSKRDDSDLVSLRDLNKEVDSLMVMKSTQKNMFLKIGSLSKEELMELSSDLDKLRQKVQRTERLTATGGQYLGNLSTTQLTRRSEILQLVGIQRPGLARTGGVVKVWDVKDSSLMINQFGSMPALTISCMTEQGGETLNDVVQALTDLGLLYTAKYPNLSDLDALMEKHPSLKVITQEDSQINISGYNLSLSAAVKAGACLIDGGNMLETIKIEESTFTTIIKTLLEIKKRENMFISAIPGQRNPYENILYKVCLSGEGWPYISSRSQIKGRAWDNTVVEFESLPSKTPVPIRNGGAPLLSTLKPEVEAQVKRSVESLMVSDTTWIDIEGPPHDPVEFAIYQPESQKFIHCYRKPNDIKSFKDQSKYCHGILLKDVEHARPGLITSIIRSLPKSMVFTAQGADDIRRLFDMHGRQDLKIVDVKLTAEQSRLFEDLVWKRFEHLCDRHKGIVIKSKKKGSAPATTNAHCALLDCLMFSAVLSGSVTNEKPKRVLPIDVLFREPDSTVVL*AXREPPLLHPPRGVQQWRPPSNHCSHLSYPSQGSAFLFEKATSTY*VPV*PMRLRAALTNNVSVGWNSQEV*EN*CCVKQCTPETDVNQ*EWCTILPFLILL*KDLRNKMITLQYPLIPEVRHIQVATIFNKPTSWKGVQI*TCVIDIPKFGVITVRNIYQFFNPVLHKNIIRDQ*IDSSNYCVQPMLCFVLKSDYCLFVEIKKPKHVTEL*VMI*VALCHSCVSKAFQFT*NQHPFFQTKTTRHISPCGIRQ*PAEKVQQFPATEDLFEITLKDKMNGADMVSLALVRPRGIF*YGKHFTTSHARIFEYAKDLAHFDVKLVTIVLFC*VHLETKVPTVFKITENFRVRINNMI*SC*QPIQCEDH*SQMGVVSKTT*DVSGFQLH*AATFKINQDCGFSYTQFNCPISFIGAS**EVGIIDLTCNW*KGCD**PIEDHMLQFEMMTNYD*AGSSRQEQEGQKLQNSTLPNIHNPFQYCNDANSYESNIESFVYD*WNFTEEGHKLSHADQSVGWVFI*I*NSICY*RKSVTPRILCA
105 | Translated Frame:REVERSED_THREE : HRGS*AFG*LCAISSL*ANNQSHHSDGHSLSGEDSRTGQLLLRQMS*LTRGLFCPA*IFQRLQVCNA**GGLKEMTVILLV*GT*IRKLTH*WS*SQPKRTCSLR*DHCQRKS*WNSPVTWTN*GKKFKEQRG*QPQVASI*GTSLQHSSPGGPKFCNLLVFRDLGWLELEGWSKSGTLKTHHL*STNLDRCQHSQFPV*LNREGKL*MMWFRPSQIWGCYTQQNTLI*VT*ML*WRSILL*R*SLRRILKLTYQATTLVYQLL*KPVLA*LMVATCWKQ*RLRSPHSQPL*KPSWRSKRGKTCSSVPYQDRGILMKTYFIRCVSLVRAGLTFHQDHRSKGEHGTIQLWNLRVYQAKHQSPSGTVVPHSCPRSSLRWRLR*REV*SR*WSVTQHGLTLKARHMTLLSLQFTSLNHKNSSTATENPMT*NLLRTKVNIVMGYC*RMLNMHDLASSQA**GVCLNPWFSLRKVLMILGGSLICMEDRI*RLLMSN*PQNSHGSLKILFGRDLNICVTDIRE*SLKVRRKDQHQPQQMHTVHFWTV*CSVLCFQGQ*PMKNQKGYYQSMFSLENPIQQLFSRLXENLHCCTPLGGCSNGGHQATIVHTFLTHLKGLPSCLKRPQAHIECLFSL*GFGQPSPTMCLWVGILKRCRKIDAV*NNVLQKQMSTSESGVRFCLSSYSFERISETR*SLSNIHSFLKSDTFK*LPFLTSQHLGRVCRFRPV*LTYQNLV*LQ*GTSINSLILFFIRILSEINELIAVITVSSRCFVLF*RVTIAFLLKSKSLSMSQNSES*FKLHFATAVFPKHFSSLEINIHFSRQKPPGISAPVASDNDQLKKFNSFLLLRTCLKLL*RTR*MVLTWFPWHWFDQEVFFSMANILPPAMHGSLNTPKTLPILM*NLSQ*CFSVKFIWRRKCPLSLRSQRIFGSESIT*SNPVSNQFNVKTTDLRWGLSPRQPEMFPAFSFIELPPSRSTKTVVSVTLNSTAQSAS*GPLSRK*VSLILHVTGKKGVIDDPLKITCCSSR**PIMTEQDLPARSKKAKSCKTPLFQTFTTPFNIAMMLTATRATLRASCMIDGISLKRDTSCPMLTNQLAGSLSKYETVSVISASQ*RLGSSV
106 | ``` 
107 | <!--automatically generated footer-->
108 | 
109 | ---
110 | 
111 | Navigation:
112 | [Home](../README.md)
113 | | [Book 1: The Core Module](README.md)
114 | | Chapter 4 : Translating
115 | 
116 | Prev: [Chapter 3 : Reading and Writing sequences](readwrite.md)
117 | 


--------------------------------------------------------------------------------
/genomics/README.md:
--------------------------------------------------------------------------------
 1 | The BioJava - Genomics Module
 2 | =====================================================
 3 | 
 4 | A tutorial for the genomics module of [BioJava](http://www.biojava.org)
 5 | 
 6 | ## About
 7 | <table>
 8 |     <tr>
 9 |         <td>
10 |             <img src="img/genomics.png"/>
11 |         </td>
12 |         <td>
13 |             The <i>genome</i> module of BioJava provides an API that allows to
14 |             <ul>
15 |                 <li>Parse popular file formats used in genomcs</li>
16 |                 <li>Convert from one file format to another</li>
17 |                 <li>Translate DNA sequences into protein sequences</li>                
18 |             </ul>
19 |         </td>
20 |     </tr>
21 | </table>   
22 | 
23 | ## Index
24 | 
25 | This tutorial is split into several chapters.
26 | 
27 | Chapter 1 - Quick [Installation](installation.md)
28 | 
29 | Chapter 2 - Reading [gene names information](genenames.md) from genenames.org
30 | 
31 | Chapter 3 - Reading [chromosomal positions](chromosomeposition.md) for genes. (UCSC's refFlat.txt.gz )
32 | 
33 | Chapter 4 - Reading [GTF and GFF files](gff.md)
34 | 
35 | Chapter 5 - Reading and writing a [Genebank](genebank.md) file
36 | 
37 | Chapter 5 - Reading [karyotype (cytoband)](karyotype.md) files
38 | 
39 | Chapter 6 - Reading genomic DNA sequences using UCSC's [.2bit file format](twobit.md)
40 | 
41 | 
42 | ## License
43 | 
44 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
45 | 
46 | ## Please Cite
47 | 
48 | **BioJava 5: A community driven open-source bioinformatics library**<br/>
49 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte* <br/>
50 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791) <br/>
51 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
52 | 
53 | 
54 | 
55 | <!--automatically generated footer-->
56 | 
57 | ---
58 | 
59 | Navigation:
60 | [Home](../README.md)
61 | | Book 4: The Genomics Module
62 | 
63 | Prev: [Book 3: The Structure Modules](../structure/README.md)
64 | 


--------------------------------------------------------------------------------
/genomics/chromosomeposition.md:
--------------------------------------------------------------------------------
 1 | Parse Chromosomal Information of Genes
 2 | ======================================
 3 | 
 4 | BioJava contains a parser the [refFlat.txt.gz](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz)
 5 | from the UCSC genome browser that contains a mapping of gene names to chromosome positions.
 6 | 
 7 | 
 8 | ```java
 9 | 	try {
10 | 
11 | 			List<GeneChromosomePosition> genePositions=	GeneChromosomePositionParser.getChromosomeMappings();
12 | 			System.out.println("got " + genePositions.size() + " gene positions") ;
13 | 
14 | 			for (GeneChromosomePosition pos : genePositions){
15 | 				if ( pos.getGeneName().equals("FOLH1")) {
16 | 					System.out.println(pos);
17 | 					break;
18 | 				}
19 | 			}
20 | 
21 | 		} catch(Exception e){
22 | 			e.printStackTrace();
23 | 		}
24 | ```
25 | 
26 | If a local copy of the file is available, it can be provide via this:
27 | 
28 | 
29 | ```java
30 | 
31 |         URL url = new URL("file://local/copy/of/file");
32 | 
33 | 		InputStreamProvider prov = new InputStreamProvider();
34 | 
35 | 		InputStream inStream = prov.getInputStream(url);
36 | 
37 | 		GeneChromosomePositionParser.getChromosomeMappings(inStream);
38 | 
39 | 
40 | 
41 | ```
42 | <!--automatically generated footer-->
43 | 
44 | ---
45 | 
46 | Navigation:
47 | [Home](../README.md)
48 | | [Book 4: The Genomics Module](README.md)
49 | | Chapter 3 : chromosomal positions
50 | 
51 | Prev: [Chapter 2 : gene names information](genenames.md)
52 | 
53 | Next: [Chapter 4 : GTF and GFF files](gff.md)
54 | 


--------------------------------------------------------------------------------
/genomics/genebank.md:
--------------------------------------------------------------------------------
  1 | Reading and writing a Genbank file
  2 | ==================================
  3 | 
  4 | There are multiple ways how to read a Genbank file.
  5 | 
  6 | ## Method 1: Read a Genbank file using the GenbankProxySequenceReader
  7 | 
  8 | ```java
  9 | GenbankProxySequenceReader<AminoAcidCompound> genbankProteinReader =
 10 |     new GenbankProxySequenceReader<AminoAcidCompound>("/tmp", "NP_000257",
 11 |         AminoAcidCompoundSet.getAminoAcidCompoundSet());
 12 | ProteinSequence proteinSequence = new ProteinSequence(genbankProteinReader);
 13 | genbankProteinReader.getHeaderParser().parseHeader(
 14 |     genbankProteinReader.getHeader(), proteinSequence);
 15 | System.out.format("Sequence(%s,%d)=%s...",
 16 |     proteinSequence.getAccession(),
 17 |     proteinSequence.getLength(),
 18 |     proteinSequence.getSequenceAsString().substring(0, 10));
 19 | 
 20 | GenbankProxySequenceReader<NucleotideCompound> genbankDNAReader =
 21 |     new GenbankProxySequenceReader<NucleotideCompound>("/tmp", "NM_001126",
 22 |         DNACompoundSet.getDNACompoundSet());
 23 | DNASequence dnaSequence = new DNASequence(genbankDNAReader);
 24 | genbankDNAReader.getHeaderParser().parseHeader(genbankDNAReader.getHeader(), dnaSequence);
 25 | System.out.format("Sequence(%s,%d)=%s...", dnaSequence.getAccession(),
 26 |     dnaSequence.getLength(),
 27 |     dnaSequence.getSequenceAsString().substring(0, 10));
 28 | ```
 29 | 
 30 | 
 31 | ## Method 2: Read a Genbank file using GenbankReaderHelper
 32 | 
 33 | ```java
 34 | File dnaFile = new File("src/test/resources/NM_000266.gb");
 35 | File protFile = new File("src/test/resources/BondFeature.gb");
 36 | 
 37 | LinkedHashMap<String, DNASequence> dnaSequences =
 38 |     GenbankReaderHelper.readGenbankDNASequence( dnaFile );
 39 | for (DNASequence sequence : dnaSequences.values()) {
 40 |   System.out.println( sequence.getSequenceAsString() );
 41 | }
 42 | 
 43 | LinkedHashMap<String, ProteinSequence> protSequences =
 44 |     GenbankReaderHelper.readGenbankProteinSequence(protFile);
 45 | for (ProteinSequence sequence : protSequences.values()) {
 46 |   System.out.println( sequence.getSequenceAsString() );
 47 | ```
 48 | 
 49 | ## Method 3: Read a Genbank file using the GenbankReader Object
 50 | 
 51 | ```java
 52 | FileInputStream is = new FileInputStream(dnaFile);
 53 | GenbankReader<DNASequence, NucleotideCompound> dnaReader =
 54 |     new GenbankReader<DNASequence, NucleotideCompound>(
 55 |         is,
 56 |         new GenericGenbankHeaderParser<DNASequence,NucleotideCompound>(),
 57 |         new DNASequenceCreator(DNACompoundSet.getDNACompoundSet())
 58 |         );
 59 | dnaSequences = dnaReader.process();
 60 | is.close();
 61 | System.out.println(dnaSequences);
 62 | 
 63 | is = new FileInputStream(protFile);
 64 | GenbankReader<ProteinSequence, AminoAcidCompound> protReader =
 65 |     new GenbankReader<ProteinSequence, AminoAcidCompound>(
 66 |         is,
 67 |         new GenericGenbankHeaderParser<ProteinSequence,AminoAcidCompound>(),
 68 |         new ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())
 69 |         );
 70 | protSequences = protReader.process();
 71 | is.close();
 72 | System.out.println(protSequences);
 73 | ```
 74 | 
 75 | 
 76 | # Write a Genbank file
 77 | 
 78 | 
 79 | Use the GenbankWriterHelper to write DNA sequences into a Genbank file.
 80 | 
 81 | ```java
 82 | // First let's read some DNA sequences from a genbank file
 83 | 
 84 | File dnaFile = new File("src/test/resources/NM_000266.gb");
 85 | LinkedHashMap<String, DNASequence> dnaSequences =
 86 |     GenbankReaderHelper.readGenbankDNASequence( dnaFile );
 87 | ByteArrayOutputStream fragwriter = new ByteArrayOutputStream();
 88 | ArrayList<DNASequence> seqs = new ArrayList<DNASequence>();
 89 | for(DNASequence seq : dnaSequences.values()) {
 90 |   seqs.add(seq);
 91 | }
 92 | 
 93 | // ok now we got some DNA sequence data. Next step is to write it
 94 | 
 95 | GenbankWriterHelper.writeNucleotideSequence(fragwriter, seqs,
 96 |     GenbankWriterHelper.LINEAR_DNA);
 97 | 
 98 | // the fragwriter object now contains a string representation in the Genbank format
 99 | // and you could write this into a file
100 | // or print it out on the console
101 | System.out.println(fragwriter.toString());
102 | ```
103 | 
104 | <!--automatically generated footer-->
105 | 
106 | ---
107 | 
108 | Navigation:
109 | [Home](../README.md)
110 | | [Book 4: The Genomics Module](README.md)
111 | | Chapter 5 : Genebank
112 | 
113 | Prev: [Chapter 4 : GTF and GFF files](gff.md)
114 | 
115 | Next: [Chapter 5 : karyotype (cytoband)](karyotype.md)
116 | 


--------------------------------------------------------------------------------
/genomics/genenames.md:
--------------------------------------------------------------------------------
 1 | Parse Gene Name Information
 2 | ===========================
 3 | 
 4 | The following code parses [a file from the www.genenames.org](http://www.genenames.org/cgi-bin/download?title=HGNC+output+data&hgnc_dbtag=on&col=gd_app_sym&col=gd_app_name&col=gd_status&col=gd_prev_sym&col=gd_prev_name&col=gd_aliases&col=gd_pub_chrom_map&col=gd_pub_acc_ids&col=md_mim_id&col=gd_pub_refseq_ids&col=md_ensembl_id&col=md_prot_id&col=gd_hgnc_id&status=Approved&status_opt=2&where=((gd_pub_chrom_map%20not%20like%20%27%patch%%27%20and%20gd_pub_chrom_map%20not%20like%20%27%ALT_REF%%27)%20or%20gd_pub_chrom_map%20IS%20NULL)%20and%20gd_locus_group%20%3d%20%27protein-coding%20gene%27&order_by=gd_app_sym_sort&format=text&limit=&submit=submit&.cgifields=&.cgifields=chr&.cgifields=status&.cgifields=hgnc_dbtag)
 5 | website that contains a mapping of human gene names to other databases.
 6 | 
 7 | 
 8 | ```java
 9 |     /** parses a file from the genenames website
10 | 	 *
11 | 	 * @param args
12 | 	 */
13 | 	public static void main(String[] args) {
14 | 
15 | 		try {
16 | 
17 | 			List<GeneName> geneNames = GeneNamesParser.getGeneNames();
18 | 
19 | 			System.out.println("got " + geneNames.size() + " gene names");
20 | 
21 | 
22 | 			for ( GeneName g : geneNames){
23 | 				if ( g.getApprovedSymbol().equals("FOLH1"))
24 | 					System.out.println(g);
25 | 			}
26 | 			// and returns a list of beans that contains key-value pairs for each gene name
27 | 
28 | 		} catch (Exception e) {
29 | 			// TODO Auto-generated catch block
30 | 			e.printStackTrace();
31 | 		}
32 | 
33 | 	}
34 | ```
35 | 
36 | If you have a local copy of this file, then you can just provide an input stream for it:
37 | 
38 | ```java
39 | 
40 |         URL url = new URL("file:///local/copy/of/file");
41 | 
42 | 		InputStreamProvider prov = new InputStreamProvider();
43 | 
44 | 		InputStream inStream = prov.getInputStream(url);
45 | 
46 | 	    GeneNamesParser.getGeneNames(inStream);
47 | 
48 | 
49 | ```
50 | <!--automatically generated footer-->
51 | 
52 | ---
53 | 
54 | Navigation:
55 | [Home](../README.md)
56 | | [Book 4: The Genomics Module](README.md)
57 | | Chapter 2 : gene names information
58 | 
59 | Prev: [Chapter 1 : Installation](installation.md)
60 | 
61 | Next: [Chapter 3 : chromosomal positions](chromosomeposition.md)
62 | 


--------------------------------------------------------------------------------
/genomics/gff.md:
--------------------------------------------------------------------------------
 1 | Reading GFF files
 2 | =================
 3 | 
 4 | The biojava3-genome library leverages the sequence relationships in biojava3-core to read (gtf,gff2,gff3) files and
 5 | write gff3 files. The file formats for gtf, gff2, gff3 are well defined but what gets written in the file is very
 6 | flexible. We currently provide support for reading gff files generated by open source gene prediction applications
 7 | GeneID, GeneMark and GlimmerHMM. Each prediction algorithm uses a different ontology to describe coding sequence,
 8 | exons, start or stop codon which makes it difficult to write a general purpose gff parser that can create biologically
 9 | meaningful objects. If the application is simply loading a gff file and drawing a colored glyph then you don't need to
10 | worry about the ontology used. It is easier to support the popular gene prediction algorithms by writing a parser that
11 | is aware of each gene prediction applications ontology.
12 | 
13 | 
14 | The following code example takes a 454scaffold file that was used by genemark to predict genes and returns a
15 | collection of ChromosomeSequences. Each chromosome sequence maps to a named entry in the fasta file and would
16 | contain N gene sequences. The gene sequences can be +/- strand with frame shifts and multiple transcriptions.
17 | 
18 | Passing the collection of ChromsomeSequences to GeneFeatureHelper.getProteinSequences would return all protein
19 | sequences. You can then write the protein sequences to a fasta file.
20 | 
21 | ```java
22 | 
23 |     LinkedHashMap<String, ChromosomeSequence> chromosomeSequenceList = GeneFeatureHelper.loadFastaAddGeneFeaturesFromGeneMarkGTF(new File("454Scaffolds.fna"), new File("genemark_hmm.gtf"));
24 |     LinkedHashMap<String, ProteinSequence> proteinSequenceList = GeneFeatureHelper.getProteinSequences(chromosomeSequenceList.values());
25 |     FastaWriterHelper.writeProteinSequence(new File("genemark_proteins.faa"), proteinSequenceList.values());
26 | ```
27 | 
28 | You can also output the gene sequence to a fasta file where the coding regions will be upper case and the non-coding regions will be lower case
29 | 
30 | ```java
31 |     LinkedHashMap<String, GeneSequence> geneSequenceHashMap = GeneFeatureHelper.getGeneSequences(chromosomeSequenceList.values());
32 |     Collection<GeneSequence> geneSequences = geneSequenceHashMap.values();
33 |     FastaWriterHelper.writeGeneSequence(new File("genemark_genes.fna"), geneSequences, true);
34 | 
35 | ```
36 | 
37 | You can easily write out a gff3 view of a ChromosomeSequence with the following code.
38 | 
39 | ```java
40 |     FileOutputStream fo = new FileOutputStream("genemark.gff3");
41 |     GFF3Writer gff3Writer = new GFF3Writer();
42 |     gff3Writer.write(fo, chromosomeSequenceList);
43 |     fo.close();
44 | ```
45 | 
46 | The chromsome sequence becomes the middle layer that represents the essence of what is mapped in a gtf, gff2 or
47 | gff3 file. This makes it fairly easy to write code to convert from gtf to gff3 or from gff2 to gtf. The challenge
48 | is picking the correct ontology for writing into gtf or gff2 formats. You could use feature names used by a
49 | specific gene prediction program or features supported by your favorite genome browser. We would like to provide a
50 | complete set of java classes to do these conversions where the list of supported gene prediction applications and
51 | genome browsers will get longer based on end user requests.
52 | 
53 | 
54 | <!--automatically generated footer-->
55 | 
56 | ---
57 | 
58 | Navigation:
59 | [Home](../README.md)
60 | | [Book 4: The Genomics Module](README.md)
61 | | Chapter 4 : GTF and GFF files
62 | 
63 | Prev: [Chapter 3 : chromosomal positions](chromosomeposition.md)
64 | 
65 | Next: [Chapter 5 : Genebank](genebank.md)
66 | 


--------------------------------------------------------------------------------
/genomics/img/genomics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/genomics/img/genomics.png


--------------------------------------------------------------------------------
/genomics/installation.md:
--------------------------------------------------------------------------------
 1 | ## Quick Installation
 2 | 
 3 | In the beginning, just one quick paragraph of how to get access to BioJava.
 4 | 
 5 | BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way:
 6 | 
 7 | BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html)  guide.
 8 | 
 9 | Currently, we are providing a BioJava specific Maven repository at (http://biojava.org/download/maven/) .
10 | 
11 | You can add the BioJava repository by adding the following XML to your project pom.xml file:
12 | 
13 | ```xml
14 |         <repositories>
15 |             ...
16 |             <repository>
17 |                 <id>biojava-maven-repo</id>
18 |                 <name>BioJava repository</name>
19 |                 <url>http://www.biojava.org/download/maven/</url>           
20 |             </repository>
21 |         </repositories>
22 | ```
23 | 
24 | We are currently in the process of changing our distribution to Maven Central, which would not even require this configuration step.
25 | 
26 | ```xml
27 |         <dependencies>
28 |                 ...
29 | 
30 |                  <!-- This imports the latest version of BioJava genomics module -->
31 |                 <dependency>
32 | 
33 |                         <groupId>org.biojava</groupId>
34 |                         <artifactId>biojava3-genomics</artifactId>
35 |                         <version>3.0.8</version>
36 |                         <!-- note: the genomics module depends on the BioJava-core module and will import it automatically -->
37 |                 </dependency>
38 | 
39 | 
40 |                 <!-- other biojava jars as needed -->
41 | 
42 |         </dependencies> 
43 | ```
44 | 
45 | If you run 
46 | 
47 | <pre>
48 |     mvn package
49 | </pre>
50 | 
51 |  on your project, the BioJava dependencies will be automatically downloaded and installed for you.
52 | 
53 | 
54 | <!--automatically generated footer-->
55 | 
56 | ---
57 | 
58 | Navigation:
59 | [Home](../README.md)
60 | | [Book 4: The Genomics Module](README.md)
61 | | Chapter 1 : Installation
62 | 
63 | Next: [Chapter 2 : gene names information](genenames.md)
64 | 


--------------------------------------------------------------------------------
/genomics/karyotype.md:
--------------------------------------------------------------------------------
 1 | Parsing a karyotype file from the UCSC genome browser
 2 | =====================================================
 3 | 
 4 | Karyotype information for the human genome can be read from UCSC's [cytoBand.txt.gz](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz)
 5 | file.
 6 | 
 7 | ```java
 8 | 
 9 |         CytobandParser me = new CytobandParser();
10 | 		try {
11 | 			SortedSet<Cytoband> cytobands = me.getAllCytobands(new URL(http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz));
12 | 			SortedSet<StainType> types = new TreeSet<StainType>();
13 | 			for (Cytoband c : cytobands){
14 | 				System.out.println(c);
15 | 				if ( ! types.contains(c.getType()))
16 | 					types.add(c.getType());
17 | 			}
18 | 			System.out.println(types);
19 | 		} catch (Exception e) {
20 | 			// TODO Auto-generated catch block
21 | 			e.printStackTrace();
22 | 		}
23 | ```
24 | 
25 | If a local copy of the file is available you can specify it in the following way:
26 | 
27 | ```java
28 | 
29 | SortedSet<Cytoband> cytobands = me.getAllCytobands(new URL("file://path/to/local/copy/"));
30 | 
31 | ```
32 | <!--automatically generated footer-->
33 | 
34 | ---
35 | 
36 | Navigation:
37 | [Home](../README.md)
38 | | [Book 4: The Genomics Module](README.md)
39 | | Chapter 5 : karyotype (cytoband)
40 | 
41 | Prev: [Chapter 5 : Genebank](genebank.md)
42 | 
43 | Next: [Chapter 6 : .2bit file format](twobit.md)
44 | 


--------------------------------------------------------------------------------
/genomics/twobit.md:
--------------------------------------------------------------------------------
 1 | Reading a .2bit file
 2 | ====================
 3 | 
 4 | UCSC's .2bit files provide a compact representation of the DNA sequences for a genome. The TwoBitParser class provides
 5 | the access to the content of this file.
 6 | 
 7 | ```java
 8 | File f = new File("/path/to/file.2bit");
 9 | TwoBitParser p = new TwoBitParser(File f);
10 | 
11 | String[] names = p.getSequenceNames();
12 | for(int i=0;i<names.length;i++) {
13 |   p.setCurrentSequence(names[i]);
14 |   p.printFastaSequence();
15 |   p.close();
16 | }
17 | 
18 | ```
19 | <!--automatically generated footer-->
20 | 
21 | ---
22 | 
23 | Navigation:
24 | [Home](../README.md)
25 | | [Book 4: The Genomics Module](README.md)
26 | | Chapter 6 : .2bit file format
27 | 
28 | Prev: [Chapter 5 : karyotype (cytoband)](karyotype.md)
29 | 


--------------------------------------------------------------------------------
/installation.md:
--------------------------------------------------------------------------------
 1 | ## Quick Installation
 2 | 
 3 | In the beginning, just one quick paragraph of how to get access to BioJava.
 4 | 
 5 | BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way:
 6 | 
 7 | BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html)  guide.
 8 | 
 9 | As of version 4, BioJava is available in maven central. This is all you would need to add a BioJava dependency to your projects:
10 | 
11 | ```xml
12 |         <dependencies>
13 |                 ...
14 | 
15 |                  <!-- This imports the latest version of BioJava genomics module -->
16 |                 <dependency>
17 | 
18 |                         <groupId>org.biojava</groupId>
19 |                         <artifactId>biojava-genome</artifactId>
20 |                         <version>4.2.0</version>
21 |                         <!-- note: the genomics module depends on the BioJava-core module and will import it automatically -->
22 |                 </dependency>
23 | 
24 | 
25 |                 <!-- other biojava jars as needed -->
26 | 
27 | 
28 |                 <!-- This imports the latest version of BioJava structure module -->
29 |                  <dependency>
30 | 
31 |                        <groupId>org.biojava</groupId>
32 |                        <artifactId>biojava-structure</artifactId>
33 |                         <version>4.2.0</version>
34 |                   </dependency>
35 |         </dependencies> 
36 | ```
37 | 
38 | If you run 
39 | 
40 | <pre>
41 |     mvn package
42 | </pre>
43 | 
44 |  on your project, the BioJava dependencies will be automatically downloaded and installed for you.
45 | 
46 | 


--------------------------------------------------------------------------------
/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/logo.png


--------------------------------------------------------------------------------
/modfinder/README.md:
--------------------------------------------------------------------------------
 1 | The ModFinder Module of BioJava
 2 | =====================================================
 3 | 
 4 | A tutorial for the modfinder module of [BioJava](http://www.biojava.org)
 5 | 
 6 | ## About
 7 | <table>
 8 |     <tr>
 9 |         <td>
10 |             <img src='https://cloud.githubusercontent.com/assets/840895/22190971/fe5cd304-e0f4-11e6-9eb5-c1b071312081.png'>
11 |         </td>
12 |         <td>
13 |             The <i>modfinder</i> module of BioJava provides an API for identification of protein pre-, co-, and post-translational modifications from structures.
14 |         </td>
15 |     </tr>
16 | </table>   
17 | 
18 | ## Index
19 | 
20 | This tutorial is split into several chapters.
21 | 
22 | Chapter 1 - Quick [Installation](installation.md)
23 | 
24 | Chapter 2 - [How to get the list of supported protein modifications](supported-protein-modifications.md)
25 | 
26 | Chapter 3 - [How to identify protein modifications in a structure](identify-protein-modifications.md)
27 | 
28 | Chapter 4 - [How to define a new protein modification](add-protein-modification.md)
29 | 
30 | ## License
31 | 
32 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
33 | 
34 | ## Please Cite
35 | 
36 | **BioJava-ModFinder: identification of protein modifications in 3D structures from the Protein Data Bank**<br/>
37 | *Jianjiong Gao; Andreas Prlic; Chunxiao Bi; Wolfgang F. Bluhm; Dimitris Dimitropoulos; Dong Xu; Philip E. Bourne; Peter W. Rose* <br/>
38 | [Bioinformatics. 2017 Feb 17.](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx101) <br/>
39 | [![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbtx101-blue.svg?style=flat)](https://doi.org/10.1093/bioinformatics/btx101) [![pubmed](http://img.shields.io/badge/pubmed-28334105-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/28334105)
40 | 
41 | **BioJava 5: A community driven open-source bioinformatics library**<br/>
42 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte* <br/>
43 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791) <br/>
44 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
45 | 
46 | 
47 | 
48 | <!--automatically generated footer-->
49 | 
50 | ---
51 | 
52 | Navigation:
53 | [Home](../README.md)
54 | | Book 6: The ModFinder Module
55 | 
56 | Prev: [Book 5: The Protein-Disorder Module Module](../protein-disorder/README.md)
57 | 


--------------------------------------------------------------------------------
/modfinder/add-protein-modification.md:
--------------------------------------------------------------------------------
 1 | How to define a new protein modification?
 2 | ===
 3 | 
 4 | The protmod module automatically loads [a list of protein modifications](supported-protein-modifications.md) into the protein modification registry. In case you have a protein modification that is not preloaded, it is possible to define it by yourself and add it into the registry.
 5 | 
 6 | ## Example: define and register disulfide bond in java code
 7 | 
 8 | ```java
 9 | // define the involved components, in this case two cystines (CYS) 
10 | List components = new ArrayList(2);
11 | components.add(Component.of("CYS"));
12 | components.add(Component.of("CYS"));
13 | 
14 | // define the atom linkages between the components, in this case the SG atoms on both CYS groups
15 | ModificationLinkage linkage = new ModificationLinkage(components, 0, “SG”, 1, “SG”);
16 | 
17 | // define the modification condition, i.e. what components are involved and what atoms are linked between them
18 | ModificationCondition condition = new ModificationConditionImpl(components, Collections.singletonList(linkage));
19 | 
20 | // build a modification
21 | ProteinModification mod =
22 |        new ProteinModificationImpl.Builder("0018_test", 
23 |        ModificationCategory.CROSS_LINK_2,
24 |        ModificationOccurrenceType.NATURAL,
25 |        condition)
26 |        .setDescription("A protein modification that effectively cross-links two L-cysteine residues to form L-cystine.")
27 |        .setFormula("C 6 H 8 N 2 O 2 S 2")
28 |        .setResidId("AA0025")
29 |        .setResidName("L-cystine")
30 |        .setPsimodId("MOD:00034")
31 |        .setPsimodName("L-cystine (cross-link)")
32 |        .setSystematicName("(R,R)-3,3'-disulfane-1,2-diylbis(2-aminopropanoic acid)")
33 |        .addKeyword("disulfide bond")
34 |        .addKeyword("redox-active center")
35 |    .build();
36 | 
37 | //register the modification
38 | ProteinModificationRegistry.register(mod);
39 | ```
40 | 
41 | ## Example: definedisulfide bond in xml file and register by java code
42 | ```xml
43 | <ProteinModifications>
44 | 	<Entry>
45 | 		<Id>0018</Id>
46 | 		<Description>A protein modification that effectively cross-links two L-cysteine residues to form L-cystine.</Description>
47 | 		<SystematicName>(R,R)-3,3'-disulfane-1,2-diylbis(2-aminopropanoic acid)</SystematicName>
48 | 		<CrossReference>
49 | 			<Source>RESID</Source>
50 | 			<Id>AA0025</Id>
51 | 			<Name>L-cystine</Name>
52 | 		</CrossReference>
53 | 		<CrossReference>
54 | 			<Source>PSI-MOD</Source>
55 | 			<Id>MOD:00034</Id>
56 | 			<Name>L-cystine (cross-link)</Name>
57 | 		</CrossReference>
58 | 		<Condition>
59 | 			<Component component="1">
60 | 				<Id source="PDBCC">CYS</Id>
61 | 			</Component>
62 | 			<Component component="2">
63 | 				<Id source="PDBCC">CYS</Id>
64 | 			</Component>
65 | 			<Bond>
66 | 				<Atom component="1">SG</Atom>
67 | 				<Atom component="2">SG</Atom>
68 | 			</Bond>
69 | 		</Condition>
70 | 		<Occurrence>natural</Occurrence>
71 | 		<Category>crosslink2</Category>
72 | 		<Keyword>redox-active center</Keyword>
73 | 		<Keyword>disulfide bond</Keyword>
74 | 	</Entry>
75 | </ProteinModifications>
76 | ```
77 | 
78 | ```java
79 | FileInputStream fis = new FileInputStream("path/to/file");
80 | ProteinModificationXmlReader.registerProteinModificationFromXml(fis);
81 | ```
82 | 
83 | 
84 | Navigation:
85 | [Home](../README.md)
86 | | [Book 6: The ModFinder Modules](README.md)
87 | | Chapter 4 - How to define a new protein modification
88 | 
89 | Prev: [Chapter 3 : How to identify protein modifications in a structure](identify-protein-modifications.md)
90 | 
91 | 


--------------------------------------------------------------------------------
/modfinder/identify-protein-modifications.md:
--------------------------------------------------------------------------------
 1 | How to identify protein modifications in a structure?
 2 | ===
 3 | 
 4 | ## Example: Identify and print all preloaded modifications from a structure
 5 | 
 6 | ```java
 7 | Set<ModifiedCompound> identifyAllModfications(Structure struc) {
 8 |    ProteinModificationIdentifier parser = new ProteinModificationIdentifier();
 9 |    parser.identify(struc);
10 |    Set<ModifiedCompound> mcs = parser.getIdentifiedModifiedCompound();
11 |    return mcs;
12 | }
13 | ```
14 | 
15 | ## Example: Identify phosphorylation sites in a structure
16 | 
17 | ```java
18 | List identifyPhosphosites(Structure struc) {
19 |     List<ResidueNumber> phosphosites = new ArrayList<>();
20 |     ProteinModificationIdentifier parser = new ProteinModificationIdentifier();
21 |     parser.identify(struc, ProteinModificationRegistry.getByKeyword("phosphoprotein"));
22 |     Set<ModifiedCompound> mcs = parser.getIdentifiedModifiedCompound();
23 |     for (ModifiedCompound mc : mcs) {
24 |         Set<StructureGroup> groups = mc.getGroups(true);
25 |         for (StructureGroup group : groups) {
26 |             phosphosites.add(group.getPDBResidueNumber());
27 |         }
28 |     }
29 |     return phosphosites;
30 | }
31 | ```
32 | 
33 | ## Demo code to run the above methods
34 | 
35 | ```java
36 | import org.biojava.nbio.structure.ResidueNumber;
37 | import org.biojava.nbio.structure.Structure;
38 | import org.biojava.nbio.structure.io.PDBFileReader;
39 | import org.biojava.nbio.protmod.structure.ProteinModificationIdentifier;
40 | 
41 | public static void main(String[] args) {
42 |     try {
43 |         PDBFileReader reader = new PDBFileReader();
44 |         reader.setAutoFetch(true);
45 | 
46 |         // identify all modificaitons from PDB:1CAD and print them
47 |         String pdbId = "1CAD";
48 |         Structure struc = reader.getStructureById(pdbId);
49 |         Set<ModifiedCompound> mcs = identifyAllModfications(struc);
50 |         for (ModifiedCompound mc : mcs) {
51 |             System.out.println(mc.toString());
52 |         }
53 | 
54 |         // identify all phosphosites from PDB:3MVJ and print them
55 |         pdbId = "3MVJ";
56 |         struc = reader.getStructureById(pdbId);
57 |         List<ResidueNumber> psites = identifyPhosphosites(struc);
58 |         for (ResidueNumber psite : psites) {
59 |             System.out.println(psite.toString());
60 |         }
61 |     } catch(Exception e) {
62 |         e.printStackTrace();  
63 |     }
64 | }
65 | ```
66 | 
67 | 
68 | Navigation:
69 | [Home](../README.md)
70 | | [Book 6: The ModFinder Modules](README.md)
71 | | Chapter 3 - How to identify protein modifications in a structure
72 | 
73 | Prev: [Chapter 2 : How to get a list of supported protein modifications](supported-protein-modifications.md)
74 | 
75 | Next: [Chapter 4 : How to define a new protein modification](add-protein-modification.md)
76 | 


--------------------------------------------------------------------------------
/modfinder/installation.md:
--------------------------------------------------------------------------------
 1 | ## Quick Installation
 2 | 
 3 | In the beginning, just one quick paragraph of how to get access to BioJava.
 4 | 
 5 | BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way:
 6 | 
 7 | BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html)  guide.
 8 | 
 9 | As of version 4, BioJava is available in maven central. This is all you would need to add BioJava dependencies to your project in the `pom.xml` file:
10 | 
11 | ```xml
12 |         <dependencies>
13 |                 ...
14 |                 <dependency>
15 |                         <!-- This imports the latest SNAPSHOT builds from the protein structure modules of BioJava.
16 |                         -->                        
17 |                         <groupId>org.biojava</groupId>
18 |                         <artifactId>biojava-structure</artifactId>
19 |                         <version>4.2.0</version>
20 |                 </dependency>
21 |                 <dependency>
22 |                         <!-- This imports the latest SNAPSHOT builds from the protein modfinder modules of BioJava.
23 |                         -->                        
24 |                         <groupId>org.biojava</groupId>
25 |                         <artifactId>biojava-modfinder</artifactId>
26 |                         <version>4.2.0</version>
27 |                 </dependency>
28 |                 <!-- other biojava jars as needed -->
29 |         </dependencies> 
30 | ```
31 | 
32 | If you run 
33 | 
34 | <pre>
35 |     mvn package
36 | </pre>
37 | 
38 |  on your project, the BioJava dependencies will be automatically downloaded and installed for you.
39 | 
40 | 
41 | <!--automatically generated footer-->
42 | 
43 | ---
44 | 
45 | Navigation:
46 | [Home](../README.md)
47 | | [Book 6: The ModFinder Modules](README.md)
48 | | Chapter 1 : Installation
49 | 
50 | Next: [Chapter 2 : How to get the list of supported protein modifications](supported-protein-modifications.md)
51 | 


--------------------------------------------------------------------------------
/modfinder/supported-protein-modifications.md:
--------------------------------------------------------------------------------
 1 | How to get a list of supported protein modifications?
 2 | ===
 3 | 
 4 | The protmod module contains [an XML file](https://github.com/biojava/biojava/blob/master/biojava-modfinder/src/main/resources/org/biojava/nbio/protmod/ptm_list.xml), defining a list of protein modifications, retrieved from [Protein Data Bank Chemical Component Dictionary](http://www.wwpdb.org/ccd.html), [RESID](http://pir.georgetown.edu/resid/), and [PSI-MOD](http://www.psidev.info/MOD). It contains many common modifications such glycosylation, phosphorylation, acelytation, methylation, etc. Crosslinks are also included, such disulfide bonds and iso-peptide bonds.
 5 | 
 6 | The protmod maintains a registry of supported protein modifications. The list of protein modifications contained in the XML file will be automatically loaded. You can [define and register a new protein modification](add-protein-modification.md) if it has not been defined in the XML file. From the protein modification registry, a user can retrieve:
 7 | - all protein modifications,
 8 | - a protein modification by ID,
 9 | - a set of protein modifications by RESID ID,
10 | - a set of protein modifications by PSI-MOD ID,
11 | - a set of protein modifications by PDBCC ID,
12 | - a set of protein modifications by category (attachment, modified residue, crosslink1, crosslink2, …, crosslink7),
13 | - a set of protein modifications by occurrence type (natural or hypothetical),
14 | - a set of protein modifications by a keyword (glycoprotein, phosphoprotein, sulfoprotein, …),
15 | - a set of protein modifications by involved components.
16 | 
17 | ## Examples
18 | 
19 | ```java 
20 | // a protein modification by ID 
21 | ProteinModification mod = ProteinModificationRegistry.getById(“0001”);
22 | 
23 | Set mods;
24 | 
25 | // all protein modifications 
26 | mods = ProteinModificationRegistry.allModifications();
27 | 
28 | // a set of protein modifications by RESID ID 
29 | mods = ProteinModificationRegistry.getByResidId(“AA0151”);
30 | 
31 | // a set of protein modifications by PSI-MOD ID 
32 | mods = ProteinModificationRegistry.getByPsimodId(“MOD:00305”);
33 | 
34 | // a set of protein modifications by PDBCC ID 
35 | mods = ProteinModificationRegistry.getByPdbccId(“SEP”);
36 | 
37 | // a set of protein modifications by category 
38 | mods = ProteinModificationRegistry.getByCategory(ModificationCategory.ATTACHMENT);
39 | 
40 | // a set of protein modifications by occurrence type 
41 | mods = ProteinModificationRegistry.getByOccurrenceType(ModificationOccurrenceType.NATURAL);
42 | 
43 | // a set of protein modifications by a keyword 
44 | mods = ProteinModificationRegistry.getByKeyword(“phosphoprotein”);
45 | 
46 | // a set of protein modifications by involved components. 
47 | mods = ProteinModificationRegistry.getByComponent(Component.of(“FAD”));
48 | 
49 | ```
50 | 
51 | Navigation:
52 | [Home](../README.md)
53 | | [Book 6: The ModFinder Modules](README.md)
54 | | Chapter 2 - How to get a list of supported protein modifications
55 | 
56 | Prev: [Chapter 1 : Installation](installation.md)
57 | 
58 | Next: [Chapter 3 : How to identify protein modifications in a structure](identify-protein-modifications.md)
59 | 


--------------------------------------------------------------------------------
/protein-disorder/README.md:
--------------------------------------------------------------------------------
  1 | The Protein-Disorder Module of BioJava
  2 | =====================================================
  3 | 
  4 | A tutorial for the protein-disorder module of [BioJava](http://www.biojava.org)
  5 | 
  6 | ## About
  7 | <table>
  8 |     <tr>
  9 |         <td>
 10 | 
 11 |         </td>
 12 |         <td>
 13 |             The <i>protein-disorder module</i> of BioJava provide an API that allows to
 14 |             <ul>
 15 |                 <li>predict protein-disorder using the JRONN algorithm</li>
 16 |             </ul>
 17 | 
 18 | 
 19 |         </td>
 20 |     </tr>
 21 | </table>   
 22 | 
 23 | ## How can I predict disordered regions on a protein sequence?
 24 | -----------------------------------------------------------
 25 | 
 26 | BioJava provide a module *biojava-protein-disorder* for prediction
 27 | disordered regions from a protein sequence. Biojava-protein-disorder
 28 | module for now contains one method for the prediction of disordered
 29 | regions. This method is based on the Java implementation of
 30 | [RONN](http://www.strubi.ox.ac.uk/RONN) predictor.
 31 | 
 32 | This code has been originally developed for use with
 33 | [JABAWS](http://www.compbio.dundee.ac.uk/jabaws). We call this code
 34 | *JRONN*. *JRONN* is based on the C implementation of RONN algorithm and
 35 | uses the same model data, therefore gives the same predictions. JRONN
 36 | based on RONN version 3.1 which is still current in time of writing
 37 | (August 2011). Main motivation behind JRONN development was providing an
 38 | implementation of RONN more suitable to use by the automated analysis
 39 | pipelines and web services. Robert Esnouf has kindly allowed us to
 40 | explore the RONN code and share the results with the community.
 41 | 
 42 | Original version of RONN is described in [Yang,Z.R., Thomson,R.,
 43 | McMeil,P. and Esnouf,R.M. (2005) RONN: the bio-basis function neural
 44 | network technique applied to the detection of natively disordered
 45 | regions in proteins. Bioinformatics 21:
 46 | 3369-3376](http://bioinformatics.oxfordjournals.org/content/21/16/3369.full)
 47 | 
 48 | Examples of use are provided below. For more information please refer to
 49 | JronnExample testcases.
 50 | 
 51 | Finally instead of an API calls you can use a [ command line
 52 | utility](http://biojava.org/wikis/BioJava:CookBook3:ProteinDisorderCLI/ "wikilink"), which is
 53 | likely to give you a better performance as it uses multiple threads to
 54 | perform calculations.
 55 | 
 56 | Example 1: Calculate the probability of disorder for every residue in the sequence
 57 | ----------------------------------------------------------------------------------
 58 | 
 59 | ```java
 60 | FastaSequence fsequence = new FastaSequence("name",
 61 |   "LLRGRHLMNGTMIMRPWNFLNDHHFPKFFPHLIEQQAIWLADWWRKKHC" +
 62 |   "RPLPTRAPTMDQWDHFALIQKHWTANLWFLTFPFNDKWGWIWFLKDWTPGSADQAQRACTWFFCHGHDTN");
 63 | 
 64 | float[] rawProbabilityScores = Jronn.getDisorderScores(fsequence);
 65 | ```
 66 | 
 67 | Example 2: Calculate the probability of disorder for every residue in the sequence for all proteins from the FASTA input file
 68 | -----------------------------------------------------------------------------------------------------------------------------
 69 | 
 70 | ```java
 71 | final List<FastaSequence> sequences = SequenceUtil.readFasta(new FileInputStream("src/test/resources/fasta.in"));
 72 | Map<FastaSequence, float[]> rawProbabilityScores = Jronn.getDisorderScores(sequences); 
 73 | ```
 74 | 
 75 | Example 3: Get the disordered regions of the protein for a single protein sequence
 76 | ----------------------------------------------------------------------------------
 77 | 
 78 | ```java
 79 | FastaSequence fsequence = new FastaSequence("Prot1", "LLRGRHLMNGTMIMRPWNFLNDHHFPKFFPHLIEQQAIWLADWWRKKHC" +
 80 |                "RPLPTRAPTMDQWDHFALIQKHWTANLWFLTFPFNDKWGWIWFLKDWTPGSADQAQRACTWFFCHGHDTN" +
 81 |                "CQIIFEGRNAPERADPMWTGGLNKHIIARGHFFQSNKFHFLERKFCEMAEIERPNFTCRTLDCQKFPWDDP");
 82 | 
 83 | Range[] ranges = Jronn.getDisorder(fsequence);
 84 | ```
 85 | 
 86 | Example 4: Calculate the disordered regions for the proteins from FASTA file
 87 | ----------------------------------------------------------------------------
 88 | 
 89 | ```java
 90 | final List<FastaSequence> sequences = SequenceUtil.readFasta(new FileInputStream("src/test/resources/fasta.in"));
 91 | Map<FastaSequence, Range[]> ranges = Jronn.getDisorder(sequences);
 92 | 
 93 | ```
 94 | 
 95 | ## License
 96 | 
 97 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
 98 | 
 99 | ## Please Cite
100 | 
101 | **BioJava 5: A community driven open-source bioinformatics library**<br/>
102 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte* <br/>
103 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791) <br/>
104 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
105 | 
106 | 
107 | 
108 | <!--automatically generated footer-->
109 | 
110 | ---
111 | 
112 | Navigation:
113 | [Home](../README.md)
114 | | Book 3: The Protein Structure modules
115 | 
116 | Prev: [Book 4: The Genomics Module](../genomics/README.md)
117 | | Next: [Book 6: The ModFinder Module](../modfinder/README.md)
118 | 


--------------------------------------------------------------------------------
/structure/README.md:
--------------------------------------------------------------------------------
 1 | The Structure Modules of BioJava
 2 | =====================================================
 3 | 
 4 | A tutorial for the structure modules of [BioJava](http://www.biojava.org)
 5 | 
 6 | ## About
 7 | <table>
 8 |     <tr>
 9 |         <td>
10 |             <img src="img/4hhb_jmol.png"/>
11 |         </td>
12 |         <td>
13 |             The <i>protein structure modules</i> of BioJava provide an API that allows to 
14 |             <ul>
15 |                 <li>Maintain local installations of PDB</li>
16 |                 <li>Load structures and manipulate them</li>
17 |                 <li>Perform standard analysis such as sequence and structure alignments</li>
18 |                 <li>Visualize structures</li>
19 |             </ul>
20 |             This tutorial provides an overview of the most important functionalities.
21 |         </td>
22 |     </tr>
23 | </table>   
24 | 
25 | ## Index
26 | 
27 | This tutorial is split into several chapters.
28 | 
29 | 
30 | Chapter 1 - Quick [Installation](installation.md)
31 | 
32 | Chapter 2 - [First Steps](firststeps.md)
33 | 
34 | Chapter 3 - The [Structure Data Model](structure-data-model.md), for the representation of macromolecular structures
35 | 
36 | Chapter 4 - [Local Installations](caching.md) of PDB
37 | 
38 | Chapter 5 - The [Chemical Component Dictionary](chemcomp.md)
39 | 
40 | Chapter 6 - How to [Work with mmCIF/PDBx Files](mmcif.md)
41 | 
42 | Chapter 7 - [SEQRES and ATOM Records](seqres.md), mapping to Uniprot (SIFTs)
43 | 
44 | Chapter 8 - [Structure Alignments](alignment.md)
45 | 
46 | Chapter 9 - [Biological Assemblies](bioassembly.md)
47 | 
48 | Chapter 10 - [External Databases](externaldb.md) like SCOP &amp; CATH
49 | 
50 | Chapter 11 - [Accessible Surface Areas](asa.md)
51 | 
52 | Chapter 12 - [Contacts Within a Chain and between Chains](contact-map.md)
53 | 
54 | Chapter 13 - Finding all Interfaces in Crystal: [Crystal Contacts](crystal-contacts.md)
55 | 
56 | Chapter 14 - [Protein Symmetry](symmetry.md)
57 | 
58 | Chapter 15 - [Protein Secondary Structure](secstruc.md)
59 | 
60 | Chapter 16 - Bonds
61 | 
62 | Chapter 17 - [Special Cases](special.md)
63 | 
64 | Chapter 18 - [Lists](lists.md) of PDB IDs and PDB [Status Information](lists.md)
65 | 
66 | 
67 | ## License
68 | 
69 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md).
70 | 
71 | ## Please Cite
72 | 
73 | **BioJava 5: A community driven open-source bioinformatics library**<br/>
74 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte* <br/>
75 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791) <br/>
76 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
77 | 
78 | 
79 | 
80 | <!--automatically generated footer-->
81 | 
82 | ---
83 | 
84 | Navigation:
85 | [Home](../README.md)
86 | | Book 3: The Structure Modules
87 | 
88 | Prev: [Book 2: The Alignment Module](../alignment/README.md)
89 | 
90 | Next: [Book 4: The Genomics Module](../genomics/README.md)
91 | 


--------------------------------------------------------------------------------
/structure/alignment-data-model.md:
--------------------------------------------------------------------------------
  1 | Structure Alignment Data Model
  2 | ===
  3 | 
  4 | ## AFPChain Data Model
  5 | 
  6 | The `AFPChain` data structure was designed to store pairwise structural
  7 | alignments. The class functions as a bean, and contains many variables 
  8 | used internally by the alignment algorithms implemented in biojava.
  9 | 
 10 | Some of the important stored variables are:
 11 | * Algorithm Name
 12 | * Optimal Alignment: described later.
 13 | * Optimal RMSD: final and total RMSD value of the alignment.
 14 | * TM-score
 15 | * BlockRotationMatrix: rotation component of the superposition transformation.
 16 | * BlockShiftVector: translation component of the superposition transformation.
 17 | 
 18 | BioJava class: [org.biojava.bio.structure.align.model.AFPChain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/model/AFPChain.html)
 19 | 
 20 | ### The Optimal Alignment
 21 | 
 22 | The residue equivalencies of the alignment (EQRs) are described in the optimal 
 23 | alignment variable, a triple array of integers, where the indices stand for:
 24 | 
 25 | ```java
 26 |   int[][][] optAln = afpChain.getOptAln();
 27 |   int residue = optAln[block][chain][eqr];
 28 | ```
 29 | 
 30 | * **block**: the blocks divide the alignment into different parts. The 
 31 | division can be due to non-topological rearrangements (e.g. circular 
 32 | permutations) or due to flexible parts (e.g. domain switch). There can 
 33 | be any number of blocks in a structural alignment, defined by the structure 
 34 | alignment algorithm.
 35 | * **chain**: in a pairwise alignment there are only two chains, or structures.
 36 | * **eqr**: EQR stands for equivalent residue position, i.e. the alignment 
 37 | position. There are as many positions (EQRs) in a block as the length of 
 38 | the alignment block, and their number is equal for any of the two chains in 
 39 | the same block.
 40 | 
 41 | In each entry (combination of the three indices described above) an integer 
 42 | is stored, which corresponds to the residue index in the specified chain, i.e.
 43 | the index in the Atom array of the chain. In between the same block, the stored
 44 | integers (residues) are always in increasing order.
 45 | 
 46 | ### Examples
 47 | 
 48 | Some examples of how to get the basic properties of an `AFPChain`:
 49 | 
 50 | ```java
 51 |   afpChain.getAlgorithmName();          //Name of the algorithm that generated the alignment
 52 |   afpChain.getBlockNum();               //Number of blocks
 53 |   afpChain.getTMScore();                //TM-score
 54 |   afpChain.getTotalRmsdOpt()            //Optimal RMSD 
 55 |   afpChain.getBlockRotationMatrix()[0]  //get the rotation matrix of the first block
 56 |   afpChain.getBlockShiftVector()[0]     //get the translation vector of the first block
 57 | ```
 58 | 
 59 | ### Overview
 60 | 
 61 | As an overview, the `AFPChain` data model:
 62 | 
 63 | * Only supports **pairwise alignments**, i.e. two chains or structures aligned.
 64 | * Can support **flexible alignments** and **non-topological alignments**. 
 65 | However, their combinatation (a flexible alignment with topological rearrangements) 
 66 | can not be represented, because the blocks mean either one or the other. 
 67 | * Can not support **non-sequential alignments**, or they would require a new block 
 68 | for each EQR, because sequentiality of the residues is assumed inside each block.
 69 | 
 70 | ## MultipleAlignment Data Model
 71 | 
 72 | Since BioJava 4.1.0, a new data model is available to store structure alignments.
 73 | The `MultipleAlignment` data structure is a general model that supports any of the 
 74 | following properties, and any combination:
 75 | 
 76 | * **Multiple structures**: the model is no longer restricted to pairwise alignments.
 77 | * **Non-topological alignments**: such as circular permutations or domain rearrangements.
 78 | * **Flexible alignments**: parts of the alignment with different superposition 
 79 | transformation.
 80 | 
 81 | In addtition, the data structure is not limited in the number and types of scores
 82 | it can store, because the scores are stored in a key:value fashion, as it will be
 83 | described later.
 84 | 
 85 | BioJava class: [org.biojava.bio.structure.align.multiple.MultipleAlignment](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/MultipleAlignment.html)
 86 | 
 87 | ### Object Hierarchy
 88 | 
 89 | The biggest difference with `AFPChain` is that the `MultipleAlignment` data 
 90 | structure is object oriented.
 91 | The hierarchy of sub-objects is represented below:
 92 | 
 93 | <pre>
 94 | MultipleAlignmentEnsemble
 95 |    |
 96 |    MultipleAlignment(s)
 97 |         |
 98 |         BlockSet(s)
 99 |             |
100 |              Block(s)
101 | </pre>
102 | 
103 | * **MultipleAlignmentEnsemble**: the ensemble is the top level of the hierarchy.
104 | As a top level, it stores information regarding creation properties (algorithm,
105 | version, creation time, etc.), the structures involved in the alignment (Atoms,
106 | structure identifiers, etc.) and cached variables (atomic distance matrices). 
107 | It contains a collection of `MultipleAlignment` that share the same properties 
108 | stored in the ensemble. This construction allows the storage of alternative 
109 | alignments inside the same data structure.
110 | 
111 | * **MultipleAlignment**: the `MultipleAlignment` stores the core information of a 
112 | multiple structure alignment. It is designed to be the return type of the multiple
113 | structure alignment algorithms. The object contains a collection of `BlockSet` and 
114 | it is linked to its parent `MultipleAlignmentEnsemble`.
115 | 
116 | * **BlockSet**: the `BlockSet` stores a flexible part of a multiple structure 
117 | alignment. A flexible part needs the residue equivalencies involved, contained in
118 | a collection of `Block`, and a transformation matrix for every structure that 
119 | describes the 3D superposition of all structures. It is linked to its parent
120 | `MultipleAlignment`.
121 | 
122 | * **Block**: the `Block` stores the aligned positions (equivalent residues) of a 
123 | `BlockSet` that are in sequentially increasing order. Each `Block` represents a 
124 | sequential part of a non-topological alignment, if more than one `Block` is present.
125 | It is linked to its parent `BlockSet`.
126 | 
127 | ### The Optimal Alignment
128 | 
129 | In the `MultipleAlignment` data structure the aligned residues are stored in a
130 | double List for every `Block`. The indices of the double List are the following:
131 | 
132 | ```java
133 |   List<List<Integer>> optAln = block.getAlnRes();
134 |   Integer residue = optAln.get(chain).get(eqr);
135 | ```
136 | 
137 | The indices mean the same as in the optimal alignment of the `AFPChain`, just to
138 | remember them:
139 | 
140 | * **chain**: chain or structure index.
141 | * **eqr**: EQR stands for equivalent residue position, i.e. the alignment 
142 | position. There are as many positions (EQRs) in a block as the length of 
143 | the alignment block, and their number is equal for any of the chains in 
144 | the same block.
145 | 
146 | As in `AFPChain`, each entry (combination of the two indices described above) 
147 | is an Integer that corresponds to the residue index in the specified chain, i.e.
148 | the index in the Atom array of the chain. Caution has to be taken in the code,
149 | because a `MultipleAlignment` can contain gaps, which are represented as `null`
150 | in the List entries.
151 | 
152 | ### Alignment Scores
153 | 
154 | All the objects in the hierarchy levels implement the `ScoresCache` interface.
155 | This interface allows the storage of any number of scores as a key:value set.
156 | The key is a `String` that describes the score and used to recover it after,
157 | and the value is a double with the calculated score. The interface has only 
158 | two methods: putScore and getScore.
159 | 
160 | The following lines of code are an example on how to do score manipulations
161 | on a `MultipleAlignment`:
162 | 
163 | ```java
164 |   //Put a score into the alignment and get it back
165 |   alignment.putScore('myRMSD', 1.234);
166 |   double myRMSD = alignment.getScore('myRMSD');
167 |   
168 |   BlockSet bs = alignment.getBlockSets().get(0);
169 |   //The same can be done for BlockSets
170 |   alignment.putScore('bsRMSD', 1.234);
171 |   double bsRMSD = alignment.getScore('bsRMSD');
172 | ```
173 | 
174 | ### Manipulating Multiple Alignments
175 | 
176 | Some classes are designed to contain utility methods for manipulating a `MultipleAlignment` object.
177 | The most important ones are ennumerated and briefly described below:
178 | 
179 | * [MultipleAlignmentScorer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentScorer.html): contains frequent names for scores and methods to calculate them.
180 | 
181 | * [MultipleAlignmentTools](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentTools.html): contains helper methods, such as sequence alignment calculation, transform atom arrays of the structures or calculate aligned residue distances between all structures.
182 | 
183 | * [MultipleAlignmentWriter](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentWriter.html): contains methods to generate different types of String outputs of the alignment, e.g. FASTA, XML, FatCat.
184 | 
185 | * [MultipleSuperimposer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleSuperimposer.html): interface for implementations that calculate the structure superpositions of the alignment. Some examples of implementations are the ReferenceSuperimposer (superimposes all the structures to a reference) and the CoreSuperimposer (only uses EQRs present in all structures, without gaps, to superimpose them).
186 | 
187 | * [MultipleAlignmentXMLParser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/xml/MultipleAlignmentXMLParser.html): contains a method to create a `MultipleAlignment` object from an XML file representation.
188 | 
189 | ### Overview
190 | 
191 | As an overview, the `MultipleAlignment` data model:
192 | 
193 | * Supports any number of aligned structures, **multiple structures**.
194 | * Can support **flexible alignments** and **non-topological alignments**,
195 | and any of their combinatations (e.g. a flexible alignment with topological 
196 | rearrangements).
197 | * Can not support **non-sequential alignments**, or they would require a new 
198 | `Block` for each EQR, because sequentiality of the residues is a requirement
199 | for each `Block`.
200 | * Can store **any score** in any of the four object hierarchy level, making it
201 | easy to adapt to new requirements and algorithms.
202 | 
203 | For more examples and information about the `MultipleAlignment` data structure 
204 | go to the Demo package on the biojava-structure module or look through the interface 
205 | files, where the javadoc explanations can be found.
206 | 
207 | ## Conversion between Data Models
208 | 
209 | The conversion from an `AFPChain` to a `MultipleAlignment` is possible trough the
210 | ensemble constructor. An example on how to do it programatically is below:
211 | 
212 | ```java
213 |   AFPChain afpChain;
214 |   Atom[] chain1;
215 |   Atom[] chain2;
216 |   boolean flexible = false;
217 |   MultipleAlignmentEnsemble ensemble = new MultipleAlignmentEnsemble(afpChain, chain1, chain2, false);
218 |   MultipleAlignment converted = ensemble.getMultipleAlignment(0);
219 | ```
220 | 
221 | There is no method to convert from a `MultipleAlignment` to an `AFPChain`, because
222 | the first representation supports any number of structures, while the second is 
223 | only supporting pairwise alignments. However, the conversion can be done with some
224 | lines of code if needed (instantiate a new `AFPChain` and copy one by one the 
225 | properties that can be represented from the `MultipleAlignment`).
226 | 
227 | ===
228 | 
229 | Go back to [Chapter 8 : Structure Alignments](alignment.md).
230 | 


--------------------------------------------------------------------------------
/structure/alignment.md:
--------------------------------------------------------------------------------
  1 | Structure Alignments
  2 | ===========================
  3 | 
  4 | ## What is a Structure Alignment?
  5 | 
  6 | A **structural alignment** attempts to establish equivalences between two or 
  7 | more polymer structures based on their shape and three-dimensional conformation. 
  8 | In contrast to simple structural superposition (see below), where at least some 
  9 | equivalent residues of the two structures are known, structural alignment requires 
 10 | no a priori knowledge of equivalent positions.
 11 | 
 12 | A **structural alignment** is a valuable tool for the comparison of proteins with 
 13 | low sequence similarity, where evolutionary relationships between proteins cannot 
 14 | be easily detected by standard sequence alignment techniques. Therefore, a 
 15 | **structural alignment** can be used to imply evolutionary relationships between 
 16 | proteins that share very little common sequence. However, caution should be exercised 
 17 | when using the results as evidence for shared evolutionary ancestry, because of the 
 18 | possible confounding effects of convergent evolution by which multiple unrelated amino 
 19 | acid sequences converge on a common tertiary structure.
 20 | 
 21 | A **structural alignment** of other biological polymers can also be made in BioJava.
 22 | For example, nucleic acids can be structurally aligned to find common structural motifs, 
 23 | independent of sequence similarity. This is specially important for RNAs, because their
 24 | 3D structure arrangement is important for their function.
 25 | 
 26 | For more info see the Wikipedia article on [structure alignment](http://en.wikipedia.org/wiki/Structural_alignment).
 27 | 
 28 | ## Alignment Algorithms Supported by BioJava
 29 | 
 30 | BioJava comes with a number of algorithms for aligning structures. The following
 31 | five options are displayed by default in the graphical user interface (GUI),
 32 | although others can be accessed programmatically using the methods in 
 33 | [StructureAlignmentFactory](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/StructureAlignmentFactory.html).
 34 | 
 35 | 1. Combinatorial Extension (CE)
 36 | 2. Combinatorial Extension with Circular Permutation (CE-CP)
 37 | 3. FATCAT - rigid
 38 | 4. FATCAT - flexible.
 39 | 5. Smith-Waterman superposition
 40 | 
 41 | **CE** and **FATCAT** both use structural similarity to align the structures, while
 42 | **Smith-Waterman** performs a local sequence alignment and then displays the result
 43 | in 3D. See below for descriptions of the algorithms.
 44 | 
 45 | Since BioJava version 4.1.0, multiple structures can be compared at the same time in 
 46 | a **multiple structure alignment**, that can later be visualized in Jmol. 
 47 | The algorithm is described in detail below. As an overview, it uses any pairwise alignment 
 48 | algorithm and a **reference** structure to perform an alignment of all the structures.
 49 | Then, it runs a **Monte Carlo** optimization to determine the residue equivalencies among
 50 | all the structures, identifying conserved **structural motifs**.
 51 | 
 52 | ## Alignment User Interface
 53 | 
 54 | Before going the details how to use the algorithms programmatically, let's take
 55 | a look at the user interface that comes with the *biojava-structure-gui* module.
 56 | 
 57 | ### Pairwise Alignment GUI
 58 | 
 59 | Generating an instance of the GUI is just one line of code:
 60 | 
 61 | ```java
 62 | AlignmentGui.getInstance();
 63 | ```
 64 | 
 65 | This code shows the following user interface:
 66 | 
 67 | ![Alignment GUI](img/alignment_gui.png)
 68 | 
 69 | You can manually select structure chains, domains, or custom files to be aligned.
 70 | Try to align 2hyn vs. 1zll. This will show the results in a graphical way, in
 71 | 3D:
 72 | 
 73 | ![3D Alignment of PDB IDs 2hyn and 1zll](img/2hyn_1zll.png)
 74 | 
 75 | and also a 2D display, that interacts with the 3D display
 76 | 
 77 | ![2D Alignment of PDB IDs 2hyn and 1zll](img/alignmentpanel.png)
 78 | 
 79 | ### Multiple Alignment GUI
 80 | 
 81 | Because of the inherent difference between multiple and pairwise alignments,
 82 | a separate GUI is used to trigger multiple structural alignments. Generating 
 83 | an instance of the GUI is analogous to the pairwise alignment GUI:
 84 | 
 85 | ```java
 86 | MultipleAlignmentGUI.getInstance();
 87 | ```
 88 | 
 89 | This code shows the following user interface:
 90 | 
 91 | ![Multiple Alignment GUI](img/multiple_gui.png)
 92 | 
 93 | The input format is a free text field, where the structure identifiers are 
 94 | indicated, space separated. A **structure identifier** is a String that
 95 | uniquely identifies a structure. It is basically composed of the pdbID, the
 96 | chain letters and the ranges of residues of each chain. For the formal description
 97 | visit [StructureIdentifier](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIdentifier.html).
 98 | 
 99 | As an example, a multiple structure alignment of 6 globins is shown here. 
100 | Their structure identifiers are shown in the previous figure of the GUI.
101 | The results are shown in a graphical way, as for the pairwise alignments:
102 | 
103 | ![3D Globin Multiple Alignment](img/multiple_jmol_globins.png)
104 | 
105 | The only difference with the Pairwise Alignment View is the possibility to show
106 | a subset of structures to be visualized, by checking the boxes under the 3D
107 | window and pressing the Show Only button afterwards.
108 | 
109 | A **sequence alignment panel** that interacts with the 3D display can also be shown.
110 | 
111 | ![3D Globin Multiple Panel](img/multiple_panel_globins.png)
112 | 
113 | Explore the coloring options in the *Edit* menu, and through the *View* menu for 
114 | alternative representations of the alignment.
115 | 
116 | The functionality to perform and visualize these alignments can also be
117 | used from your own code. Let's first have a look at the alignment algorithms.
118 | 
119 | ## Pairwise Alignment Algorithms
120 | 
121 | ### Combinatorial Extension (CE)
122 | 
123 | The Combinatorial Extension (CE) algorithm was originally developed by
124 | [Shindyalov and Bourne in
125 | 1998](http://peds.oxfordjournals.org/content/11/9/739.short) [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/9796821).
126 | It works by identifying segments of the two structures with similar local
127 | structure, and then combining those to try to align the most residues possible
128 | while keeping the overall root-mean-square deviation (RMSD) of the superposition low.
129 | 
130 | CE is a rigid-body alignment algorithm, which means that the structures being
131 | compared are kept fixed during superposition. In some cases it may be desirable
132 | to break large proteins up into domains prior to aligning them (by manually
133 | inputting a subrange, using the [SCOP or CATH databases](externaldb.md), or by
134 | decomposing the protein automatically using the [Protein Domain
135 | Parser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/domain/LocalProteinDomainParser.html)
136 | algorithm).
137 | 
138 | BioJava class: [org.biojava.bio.structure.align.ce.CeMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeMain.html)
139 | 
140 | ### Combinatorial Extension with Circular Permutation (CE-CP)
141 | 
142 | CE and FATCAT both assume that aligned residues occur in the same order in both
143 | structures (e.g. they are both *sequence-order dependent* algorithms). In proteins
144 | related by a circular permutation, the N-terminal part of one protein is related
145 | to the C-terminal part of the other, and vice versa. CE-CP allows circularly
146 | permuted proteins to be compared.  For more information on circular
147 | permutations, see the
148 | [Wikipedia](http://en.wikipedia.org/wiki/Circular_permutation_in_proteins) or
149 | [Molecule of the Month](https://pdb101.rcsb.org/motm/124)
150 | articles [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628).
151 | 
152 | 
153 | For proteins without a circular permutation, CE-CP results look very similar to
154 | CE results (with perhaps some minor differences and a slightly longer
155 | calculation time). If a circular permutation is found, the two halves of the
156 | proteins will be shown in different colors:
157 | 
158 | ![Concanavalin A (yellow & orange) aligned with Pea Leptin (blue and cyan)](img/3cna.A_2pel.A_cecp.png)
159 | 
160 | CE-CP was developed by Spencer E. Bliven, Philip E. Bourne, and Andreas Prli&#263;.
161 | 
162 | BioJava class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html)
163 | 
164 | ### FATCAT - rigid
165 | 
166 | This is a Java implementation of the original FATCAT algorithm by [Yuzhen Ye
167 | &amp; Adam Godzik in
168 | 2003](http://bioinformatics.oxfordjournals.org/content/19/suppl_2/ii246.abstract)
169 | [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/14534198).
170 | It performs similarly to CE for most structures. The 'rigid' flavor uses a
171 | rigid-body superposition and only considers alignments with matching sequence
172 | order.
173 | 
174 | BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html)
175 | 
176 | ### FATCAT - flexible
177 | 
178 | FATCAT-flexible introduces 'twists' between different parts of the structures
179 | which are superimposed independently. This is ideal for proteins which undergo
180 | large conformational shifts, where a global superposition cannot capture the
181 | underlying similarity between domains. For instance, the structures of
182 | calmodulin with and without calcium bound can be much better aligned with
183 | FATCAT-flexible than with one of the rigid alignment algorithms. The downside of
184 | this is that it can lead to additional false positives in unrelated structures.
185 | 
186 | ![(Left) Rigid and (Right) flexible alignments of calmodulin](img/1cfd_1cll_fatcat.png)
187 | 
188 | BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html)
189 | 
190 | ### Smith-Waterman
191 | 
192 | This aligns residues based on Smith and Waterman's 1981 algorithm for local
193 | *sequence* alignment [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/7265238). No structural information is included in the alignment, so
194 | this only works for structures with significant sequence similarity. It uses the
195 | Blosum65 scoring matrix.
196 | 
197 | The two structures are superimposed based on this alignment. Be aware that errors
198 | locating gaps can lead to high RMSD in the resulting superposition due to a
199 | small number of badly aligned residues. However, this method is faster than
200 | the structure-based methods.
201 | 
202 | BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html)
203 | 
204 | ### Other methods
205 | 
206 | The following methods are not presented in the user interface by default:
207 | 
208 | * [BioJavaStructureAlignment](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/BioJavaStructureAlignment.html)
209 |   A structure-based alignment method able of returning multiple alternate
210 |   alignments. It was written by Andreas Prli&#263; and based on the PSC++ algorithm
211 |   provided by Peter Lackner.
212 | * [CeSideChainMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeSideChainMain.html)
213 |   A variant of CE using CB-CB distances, which sometimes improves alignments in
214 |   proteins with parallel sheets and helices.
215 | * [OptimalCECPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/OptimalCECPMain.html)
216 |   An alternate (much slower) algorithm for finding circular permutations.
217 | 
218 | Additional methods can be added by implementing the
219 | [StructureAlignment](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/StructureAlignment.html)
220 | interface.
221 | 
222 | ## Multiple Structure Alignment
223 | 
224 | This Java implementation for multiple structure alignments, named MultipleMC, is based on the original CE-MC implementation by [Guda C, Scheeff ED, Bourne PE &amp; Shindyalov IN in 2001](http://psb.stanford.edu/psb-online/proceedings/psb01/abstracts/p275.html)
225 | [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/11262947).
226 | 
227 | The idea remains unchanged: perform **all-to-all pairwise alignments** of the structures, choose the 
228 | **reference** as the most similar structure to all others and run a **Monte Carlo optimization** of
229 | the multiple residue equivalencies (EQRs) to minimize a score function that depends on the inter-residue
230 | distances.
231 | 
232 | However, some details of the implementation have been changed in the BioJava version. 
233 | They are described in the main class, as a summary:
234 | 
235 | 1. It accepts **any pairwise alignment** algorithm (instead of being attached to CE), so any
236 | of the algorithms described before is suitable for generating a seed for optimization. Note that
237 | this property allows *non-topological* and *flexible* multiple structure alignments, always restricted
238 | by the pairwise alignment algorithm limitations.
239 | 2. The **moves** in the Monte Carlo optimization have been simplified to 3.
240 | 3. A **new move** to insert and delete individual gaps has been added.
241 | 4. The scoring function has been modified to a **continuous** function, maintaining the properties that the authors described.
242 | 5. The **probability function** is normalized in synchronization with the optimization progression, to improve the convergence into a maximum score after some random exploration of the multidimensional alignment space.
243 | 
244 | The algorithm performs similarly to other multiple structure alignment algorithms for most protein families. 
245 | The parameters both for the pairwise aligner and the MC optimization can have an impact on the final result. There is not a unique set of parameters, because they usually depend on the specific use case. Thus, trying some parameter combinations, keeping in mind the effect they produce in the score function, is a good practice when doing any structure alignment.
246 | 
247 | BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html)
248 | 
249 | 
250 | ## Creating Alignments Programmatically
251 | 
252 | The **pairwise structure alignment** algorithms in BioJava implement the
253 | `StructureAlignment` interface, and are usually accessed through
254 | `StructureAlignmentFactory`. Here's an example of how to create a CE-CP
255 | alignment and print some information about it.
256 | 
257 | ```java
258 | // Fetch CA atoms for the structures to be aligned
259 | String name1 = "3cna.A";
260 | String name2 = "2pel";
261 | AtomCache cache = new AtomCache();
262 | Atom[] ca1 = cache.getAtoms(name1);
263 | Atom[] ca2 = cache.getAtoms(name2);
264 | 
265 | // Get StructureAlignment instance
266 | StructureAlignment algorithm  = StructureAlignmentFactory.getAlgorithm(CeCPMain.algorithmName);
267 | 
268 | // Perform the alignment
269 | AFPChain afpChain = algorithm.align(ca1,ca2);
270 | 
271 | // Print text output
272 | System.out.println(afpChain.toCE(ca1,ca2));
273 | ```
274 | 
275 | To display the alignment using Jmol, use:
276 | 
277 | ```java
278 | GuiWrapper.display(afpChain, ca1, ca2);
279 | // Or using the biojava-structure-gui module
280 | StructureAlignmentDisplay.display(afpChain, ca1, ca2);
281 | ```
282 | 
283 | Note that these require that you include the structure-gui package and the Jmol
284 | binary in the classpath at runtime.
285 | 
286 | For creating **multiple structure alignments**, the code is a little bit different, because the
287 | returned data structure and the number of input structures are different. Here is an 
288 | example of how to create and display a multiple alignment:
289 | 
290 | ```java
291 | //Specify the structures to align: some ASP-proteinases
292 | List<String> names = Arrays.asList("3app", "4ape", "5pep", "1psn", "4cms", "1bbs.A", "1smr.A");
293 | 
294 | //Load the CA atoms of the structures and create the structure identifiers
295 | AtomCache cache = new AtomCache();
296 | List<Atom[]> atomArrays = new ArrayList<Atom[]>();
297 | List<StructureIdentifier> identifiers = new ArrayList<StructureIdentifier>();
298 | for (String name:names)	{
299 |   atomArrays.add(cache.getAtoms(name));
300 |   identifiers.add(new SubstructureIdentifier(name));
301 | }
302 | 
303 | //Generate the multiple alignment algorithm with the chosen pairwise algorithm
304 | StructureAlignment pairwise  = StructureAlignmentFactory.getAlgorithm(CeMain.algorithmName);
305 | MultipleMcMain multiple = new MultipleMcMain(pairwise);
306 | 
307 | //Perform the alignment
308 | MultipleAlignment result = multiple.align(atomArrays);
309 | 
310 | // Set the structure identifiers, so that each atom array can be identified in the outputs
311 | result.getEnsemble().setStructureIdentifiers(identifiers);
312 | 
313 | //Output the FASTA sequence alignment
314 | System.out.println(MultipleAlignmentWriter.toFASTA(result));
315 | 
316 | //Display the results in a 3D view
317 | MultipleAlignmentJmolDisplay.display(result);
318 | ```
319 | 
320 | ## Command-Line Tools
321 | 
322 | Many of the alignment algorithms are available in the form of command line
323 | tools. These can be accessed through the main methods of the StructureAlignment
324 | classes.
325 | 
326 | Example:
327 | ```bash
328 | runCE.sh -pdb1 4hhb.A -pdb2 4hhb.B -show3d
329 | ```
330 | 
331 | Using the command line tool it is possible to run pairwise alignments, several
332 | alignments in batch mode, or full database searches. Some additional parameters
333 | are available which are not exposed in the GUI, such as outputting results to a
334 | file in various formats.
335 | 
336 | ## Alignment Data Model
337 | 
338 | For details about the structure alignment data models in BioJava, see [Structure Alignment Data Model](alignment-data-model.md)
339 | 
340 | ## Acknowledgements
341 | 
342 | Thanks to P. Bourne, Yuzhen Ye and A. Godzik for granting permission to freely use and redistribute their algorithms.
343 | 
344 | <!--automatically generated footer-->
345 | 
346 | ---
347 | 
348 | Navigation:
349 | [Home](../README.md)
350 | | [Book 3: The Structure Modules](README.md)
351 | | Chapter 8 : Structure Alignments
352 | 
353 | Prev: [Chapter 7 : SEQRES and ATOM Records](seqres.md)
354 | 
355 | Next: [Chapter 9 : Biological Assemblies](bioassembly.md)
356 | 


--------------------------------------------------------------------------------
/structure/asa.md:
--------------------------------------------------------------------------------
 1 | # Calculating Accessible Surface Areas
 2 | 
 3 | BioJava can also do calculation of Accessible Surface Areas (ASA) through an implementation of the rolling ball algorithm of Shrake and Rupley [Shrake 1973].
 4 | 
 5 | This code will do the ASA calculation and output the values per residue and the total:
 6 | ```java
 7 | 		AtomCache cache = new AtomCache();
 8 | 		cache.setUseMmCif(true);
 9 | 		
10 | 		StructureIO.setAtomCache(cache); 
11 | 		
12 | 		Structure structure = StructureIO.getStructure("1smt");
13 | 		
14 | 		AsaCalculator asaCalc = new AsaCalculator(structure, 
15 | 				AsaCalculator.DEFAULT_PROBE_SIZE, 
16 | 				1000, 1, false);
17 | 		
18 | 		GroupAsa[] groupAsas = asaCalc.getGroupAsas();
19 | 		
20 | 		double tot = 0;
21 | 		
22 | 		for (GroupAsa groupAsa: groupAsas) {
23 | 			System.out.printf("%1s\t%5s\t%3s\t%6.2f\n", 
24 | 					groupAsa.getGroup().getChainId(),
25 | 					groupAsa.getGroup().getResidueNumber(),
26 | 					groupAsa.getGroup().getPDBName(), 
27 | 					groupAsa.getAsaU());
28 | 			tot+=groupAsa.getAsaU();
29 | 		}
30 | 		
31 | 		System.out.printf("Total area: %9.2f\n",tot);
32 | 		
33 | ```
34 | See [DemoAsa](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoAsa.java) for a fully working demo.
35 | 
36 | [Shrake 1973]: http://www.sciencedirect.com/science/article/pii/0022283673900119
37 | 
38 | <!--automatically generated footer-->
39 | 
40 | ---
41 | 
42 | Navigation:
43 | [Home](../README.md)
44 | | [Book 3: The Structure Modules](README.md)
45 | | Chapter 11 : Accessible Surface Areas
46 | 
47 | Prev: [Chapter 10 : External Databases](externaldb.md)
48 | 
49 | Next: [Chapter 12 : Contacts Within a Chain and between Chains](contact-map.md)
50 | 


--------------------------------------------------------------------------------
/structure/bioassembly.md:
--------------------------------------------------------------------------------
  1 | Asymmetric Unit and Biological Assembly
  2 | =======================================
  3 | 
  4 | For many proteins, the asymmetric unit and the biological assembly are the same. However there are quite a few proteins where they are not identical and depending on what you are interested in, it might be important that you work with the biological assembly, instead of the asymmetric unit.
  5 | 
  6 | ## Asymmetric Unit
  7 | 
  8 | The asymmetric unit is the smallest portion of a crystal structure to which symmetry operations can be applied in order to generate the complete unit cell (the crystal repeating unit). 
  9 | 
 10 | A crystal asymmetric unit may contain:
 11 | 
 12 | * one biological assembly
 13 | * a portion of a biological assembly
 14 | * multiple biological assemblies
 15 | 
 16 | ## Biological Assembly
 17 | 
 18 | The biological assembly (also sometimes referred to as the biological unit) is the macromolecular assembly that has either been shown to be or is believed to be the functional form of the molecule For example, the functional form of hemoglobin has four chains.
 19 | 
 20 | The [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) and [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) classes in Biojava provide access methods to work with either asymmetric unit or biological assembly.
 21 | 
 22 | Let's load both representations of hemoglobin PDB ID [1HHO](http://www.rcsb.org/pdb/explore.do?structureId=1hho) and visualize it:
 23 | 
 24 | ```java
 25 |     public static void main(String[] args){
 26 | 
 27 |         try {
 28 |             Structure asymUnit = StructureIO.getStructure("1hho");
 29 | 
 30 |             showStructure(asymUnit);
 31 |             
 32 |             Structure bioAssembly = StructureIO.getBiologicalAssembly("1hho");
 33 |             
 34 |             showStructure(bioAssembly);
 35 |             
 36 |         } catch (Exception e){
 37 |             e.printStackTrace();
 38 |         }
 39 | 
 40 |     }
 41 | 
 42 |     public static void showStructure(Structure structure){
 43 | 
 44 |         StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol();
 45 | 
 46 |         jmolPanel.setStructure(structure);
 47 | 
 48 |         // send some commands to Jmol
 49 |         jmolPanel.evalString("select * ; color chain;");            
 50 |         jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on;  ");
 51 |         jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;");
 52 | 
 53 |     }
 54 | ```
 55 | 
 56 | <table>
 57 |     <tr>
 58 |         <td>
 59 |             The <b>asymmetric unit</b> of hemoglobin PDB ID <a href="http://www.rcsb.org/pdb/explore.do?structureId=1hho">1HHO</a>
 60 |         </td>
 61 |         <td>
 62 |             The <b>biological assembly</b> of hemoglobin PDB ID <a href="http://www.rcsb.org/pdb/explore.do?structureId=1hho">1HHO</a>
 63 |         </td>
 64 |     </tr>
 65 |     <tr>
 66 |         <td>
 67 |             <img src="img/1hho_asym.png"/>
 68 |         </td>
 69 |         <td>
 70 |             <img src="img/1hho_biounit.png"/>
 71 |         </td>
 72 |     </tr>
 73 | </table>
 74 | 
 75 | As we can see, the two representations are quite different! When investigating protein interfaces, ligand binding and for many other applications, you always want to work with the biological assemblies.
 76 | 
 77 | Here another example, the bacteriophave GA protein capsid PDB ID [1GAV](http://www.rcsb.org/pdb/explore.do?structureId=1gav)
 78 | 
 79 | <table>
 80 |     <tr>
 81 |         <td>
 82 |             The <b>asymmetric unit</b> of bacteriophave GA protein capsid PDB ID  <a href="http://www.rcsb.org/pdb/explore.do?structureId=1gav">1GAV</a>
 83 |         </td>
 84 |         <td>
 85 |             The <b>biological assembly</b> of bacteriophave GA protein capsid PDB ID  <a href="http://www.rcsb.org/pdb/explore.do?structureId=1gav">1GAV</a>
 86 |         </td>
 87 |     </tr>
 88 |     <tr>
 89 |         <td>
 90 |             <img src="img/1gav_asym.png"/>
 91 |         </td>
 92 |         <td>
 93 |             <img src="img/1gav_biounit.png"/>
 94 |         </td>
 95 |     </tr>
 96 | </table>
 97 | 
 98 | ## Re-creating Biological Assemblies
 99 | 
100 | Since biological assemblies can be accessed via the StructureIO interface, in principle there is no need to access the lower-level code in BioJava that allows to re-create biological assemblies. If you are interested in looking at the gory details of this, here a couple of pointers into the code. In principle there are two ways for how to get to a biological assembly:
101 | 
102 | 1. The biological assembly needs to be re-built and the atom coordinates of the asymmetric unit need to be rotated according to the instructions in the files. The information required to re-create the biological assemblies is available in both the PDB an mmCIF/PDBx files. In PDB files the relevant transformations are stored in the *REMARK 350* records. For mmCIF/PDBx, the *_pdbx_struct_assembly* and *_pdbx_struct_oper_list* categories store the corresponding rules.
103 | 
104 | 2. There is also a pre-computed file available from the PDB that contains an assembled version of a structure. This file can be parsed directly, without having to perform rotation operations on coordinates.
105 | 
106 | As of version 5.0 BioJava contains utility classes to re-create biological assemblies for both PDB and mmCIF files.
107 | 
108 | Take a look at the method `getBiologicalAssembly()` in [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html)  to see how the underlying *BiologicalAssemblyBuilder* is called.
109 | 
110 | ## Memory consumption
111 | 
112 | This example in the next section loads the structure of the PBCV-1 virus capsid (PDB ID [1M4X](http://www.rcsb.org/pdb/explore.do?structureId=1m4x)). It consists of 16 million atoms and has one of the largest, if not the largest biological assembly that is currently available in the PDB. Needless to say it is important to change the maximum heap size parameter, otherwise you will not be able to load it. It requires a minimum of 9GB RAM to load (measured on Java 1.7 on OSX). You can change the heap size by providing the following startup parameter (and assuming you have 10G or more of RAM available on your system)
113 | <pre>
114 |     -Xmx10G 
115 | </pre>
116 | 
117 | Note: when loading this structure with 9GB of memory, the Java VM spends a significant amount of time in garbage collection (GC). If you provide more RAM than the minimum requirement, then GC is triggered less often and the biological assembly loads faster.
118 | 
119 | <table>
120 |     <tr>
121 |         <td>
122 |           <img src="img/1m4x_bio_r_250.jpg"/>
123 |         </td>       
124 |     </tr>
125 |     <tr>
126 |         <td>
127 |             The biological assembly of the PBCV-1 virus capsid. (image source: <a href="http://www.rcsb.org/pdb/explore.do?structureId=1m4x">RCSB</a>)
128 |         </td>
129 |     </tr>
130 | </table>
131 | 
132 | ## Representing symmetry related chains
133 | Chains are identified by chain identifiers which serve to distinguish the different molecular entities present in the asymmetric unit. Once a biological assembly is built it can be composed of chains from both the asymmetric unit or from chains resulting in applying a symmetry operator (this chains are also called "symmetry mates"). The problem with that is that the symmetry mates will get the same chain identifiers as the untransformed chains. 
134 | 
135 | In order to solve that issue there are 2 solutions:
136 | 
137 | 1. Assign new chain identifiers. In BioJava the new chain identifiers assigned are of the form `<original chain id>_<symmetry operator id>` (the symmetry operator id is numerical and is the one in field `_pdbx_struct_oper_list.id` in the mmCIF file).
138 | 2. Place the symmetry partners into different models. This is the solution taken by the pre-computed biounit files available from the PDB. 
139 | 
140 | Since version 5.0 BioJava uses approach 1) to store the biounit in a single `Structure` object. Because the chain identifiers are then of more than 1 character, the Structure can only be written out in mmCIF format (PDB format is limited to 1 character chain identifiers).
141 | 
142 | In BioJava one can still produce a biounit using approach 2) by passing a boolean parameter to the `getBiologicalAssembly` method:
143 | ```java
144 | Structure struct = StructureIO.getBiologicalAssembly(pdbId, true);
145 | ```
146 | ## PDB entries with more than 1 biological assemblies
147 | Many PDB entries are assigned more than 1 biological assemblies. This is due to many factors: sometimes the authors disagree with the annotators, sometimes the authors are not sure about which biological assembly is the right one, sometimes there are several equivalent biological assemblies present in the asymmetric unit (but with slightly different  conformations) and each of those is annotated as a different biological assembly.
148 | 
149 | To get all biological assemblies for a given PDB entry one needs to use:
150 | ```java
151 | List<Structure> bioAssemblies = StructureIO.getBiologicalAssemblies(pdbId);
152 | ```
153 | 
154 | ## Further Reading
155 | 
156 | The RCSB PDB web site has a great [tutorial on Biological Assemblies](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies).
157 | 
158 | <!--automatically generated footer-->
159 | 
160 | ---
161 | 
162 | Navigation:
163 | [Home](../README.md)
164 | | [Book 3: The Structure Modules](README.md)
165 | | Chapter 9 : Biological Assemblies
166 | 
167 | Prev: [Chapter 8 : Structure Alignments](alignment.md)
168 | 
169 | Next: [Chapter 10 : External Databases](externaldb.md)
170 | 


--------------------------------------------------------------------------------
/structure/caching.md:
--------------------------------------------------------------------------------
 1 | Local PDB Installations
 2 | =======================
 3 | 
 4 | BioJava can automatically download and install most of the data files that it needs. Those downloads 
 5 | will happen only once. Future requests for the data file will re-use the local copy.
 6 | 
 7 | The main class that provides this functionality is the [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html).
 8 | 
 9 | It is hidden inside the StructureIO class, that we already encountered earlier.
10 | 
11 | ```java
12 | 	Structure structure = StructureIO.getStructure("4hhb");			
13 | ```
14 | 
15 | is the same as
16 | 
17 | ```java
18 | 	AtomCache cache = new AtomCache();
19 | 	cache.getStructure("4hhb");
20 | ```
21 | 
22 | 
23 | ## Where Are the Files Written to?
24 | 
25 | By default the AtomCache writes all files into a temporary location (The system temp directory "java.io.tempdir"). 
26 | 
27 | If you already have a local PDB installation, or you want to use a more permanent location to store the files,
28 | you can configure the AtomCache by setting the PDB_DIR system property
29 | 
30 | <pre>
31 |     -DPDB_DIR=/wherever/you/want/
32 | </pre>
33 | 
34 | BioJava will also check for a `PDB_DIR` environmental variable. If you launch BioJava from the command line, it can be useful to include `export PDB_DIR=/wherever/you/want` in your `.bashrc` file.
35 | 
36 | An alternative is to hard-code the path in this way (but setting it as a property is better style)
37 | 
38 | ```java
39 | 	AtomCache cache = new AtomCache();
40 | 
41 | 	cache.setPath("/path/to/pdb/files/");
42 | ```
43 | 
44 | ## File Parsing Parameters
45 | 
46 | The AtomCache also provides access to configuring various options that are available during the 
47 | parsing of files. The [FileParsingParameters](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/FileParsingParameters.html)
48 | class is the main place to influence the level of detail and as a consequence the speed with which files can be loaded.
49 | 
50 | This example turns on the use of chemical components when loading a `Structure`. (See also the [next chapter](chemcomp.md))
51 | 
52 | ```java
53 | 	AtomCache cache = new AtomCache();
54 | 
55 | 	cache.setPath("/tmp/");
56 | 
57 | 	FileParsingParameters params = cache.getFileParsingParams();
58 | 
59 | 	StructureIO.setAtomCache(cache);
60 | 
61 | 	Structure structure = StructureIO.getStructure("4hhb");			
62 | 
63 | ```
64 | 
65 | ## Caching of other SCOP, CATH
66 | 
67 | The AtomCache not only provides access to PDB, it can also fetch Structure representations of protein domains, as defined by SCOP and CATH, and the algorithms Protein Domain Parser (PDP) and Domain Parser (DP).
68 | 
69 | ```java
70 | 	// uses a SCOP domain definition
71 | 	Structure domain1 = StructureIO.getStructure("d4hhba_");
72 | 	
73 | 	// Get a specific protein chain, note: chain IDs are case sensitive, PDB IDs are not.
74 | 	Structure chain1 = StructureIO.getStructure("4HHB.A");
75 | 	
76 | ```
77 | 
78 | There are quite a number of external database IDs that are supported here. See the 
79 | <a href="http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html#getStructure(java.lang.String)">AtomCache documentation</a> for more details on the supported options.
80 | 
81 | The non-PDB files can be cached at a different location by setting the `PDB_CACHE_DIR` property (with `java -DPDB_CACHE_DIR=...`) or environmental variable.
82 | 
83 | <!--automatically generated footer-->
84 | 
85 | ---
86 | 
87 | Navigation:
88 | [Home](../README.md)
89 | | [Book 3: The Structure Modules](README.md)
90 | | Chapter 4 : Local Installations
91 | 
92 | Prev: [Chapter 3 : Structure Data Model](structure-data-model.md)
93 | 
94 | Next: [Chapter 5 : Chemical Component Dictionary](chemcomp.md)
95 | 


--------------------------------------------------------------------------------
/structure/chemcomp.md:
--------------------------------------------------------------------------------
 1 | The Chemical Component Dictionary
 2 | =================================
 3 | 
 4 | The [Chemical Component Dictionary](http://www.wwpdb.org/ccd.html) is an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules.
 5 | 
 6 | ### How Does BioJava Decide what Groups Are Amino Acids?
 7 | 
 8 | BioJava utilizes the Chem. Comp. Dictionary to achieve a chemically correct representation of each group. To make it clear how this can work, let's take a look at how [Selenomethionine](http://en.wikipedia.org/wiki/Selenomethionine) and water is dealt with:
 9 | 
10 | ```java
11 | Structure structure = StructureIO.getStructure("1A62");
12 | 
13 | for (Chain chain : structure.getChains()){
14 |     for (Group group : chain.getAtomGroups()){
15 |         if ( group.getPDBName().equals("MSE") || group.getPDBName().equals("HOH")){
16 |             System.out.println(group.getPDBName() + " is a group of type " + group.getType());
17 |         }
18 |     }
19 | }
20 | ```
21 | 
22 | This will give this output:
23 | 
24 | <pre>
25 | MSE is a group of type amino
26 | MSE is a group of type amino
27 | MSE is a group of type amino
28 | HOH is a group of type hetatm
29 | HOH is a group of type hetatm
30 | HOH is a group of type hetatm
31 | ...
32 | </pre>
33 | 
34 | As you can see, although MSE is flaged as HETATM in the PDB file, BioJava still represents it correctly as an amino acid. They key is that the [definition file for MSE](http://www.rcsb.org/pdb/files/ligand/MSE.cif) flags it as "L-PEPTIDE LINKING", which is being used by BioJava.
35 | 
36 | Note: Selenomethionine is a naturally occurring amino acid containing selenium. It has the ID MSE in the Chemical Component Dictionary.
37 | 
38 | 
39 | ### How to Access Chemical Component Definitions
40 | 
41 | By default BioJava will retrieve the full chemical component definitions provided by the PDB. That way BioJava makes sure that the user gets a correct representation e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues, etc.
42 | 
43 | The behaviour is configurable by setting a property in the `ChemCompGroupFactory` singleton:
44 | 
45 | 1. Use a minimal built-in set of **Chemical Component Definitions**. Will only deal with most frequent cases of chemical components. Does not guarantee a correct representation, but it is fast and does not require network access.
46 | ```java
47 |      ChemCompGroupFactory.setChemCompProvider(new ReducedChemCompProvider());
48 | ```
49 | 2. Load all **Chemical Component Definitions**  at startup (slow startup, but then no further delays later on, requires more memory)
50 | ```java
51 |      ChemCompGroupFactory.setChemCompProvider(new AllChemCompProvider());
52 | ```
53 | 3. Fetch missing **Chemical Component Definitions** on the fly (small download and parsing delays every time a new chemical compound is found). Default behaviour since 4.2.0. Note that the chemical component files are cached in the local file system for subsequent uses.
54 | ```java
55 |      ChemCompGroupFactory.setChemCompProvider(new DownloadChemCompProvider());
56 | ```
57 | 
58 | 
59 | <!--automatically generated footer-->
60 | 
61 | ---
62 | 
63 | Navigation:
64 | [Home](../README.md)
65 | | [Book 3: The Structure Modules](README.md)
66 | | Chapter 5 : Chemical Component Dictionary
67 | 
68 | Prev: [Chapter 4 : Local Installations](caching.md)
69 | 
70 | Next: [Chapter 6 : Work with mmCIF/PDBx Files](mmcif.md)
71 | 


--------------------------------------------------------------------------------
/structure/contact-map.md:
--------------------------------------------------------------------------------
 1 | # Finding contacts between atoms in a protein: contact maps
 2 | 
 3 | Contacts are a useful tool to analyse protein structures. They simplify the 3-Dimensional view of the structures into a 2-Dimensional set of contacts between its atoms or its residues. The representation of the contacts in a matrix is known as the contact map. Many protein structure analysis and prediction efforts are done by using contacts. For instance they can be useful for:
 4 | 
 5 | + development of structural alignment algorithms [Holm 1993][] [Caprara 2004][]
 6 | + automatic domain identification [Alexandrov 2003][] [Emmert-Streib 2007][]
 7 | + structural modelling by extraction of contact-based empirical potentials [Benkert 2008][]
 8 | + structure prediction via contact prediction from sequence information [Jones 2012][]
 9 | 
10 | ## Getting the contact map of a protein chain
11 | 
12 | This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT):
13 | 
14 | ```java
15 | 		AtomCache cache = new AtomCache();
16 | 		StructureIO.setAtomCache(cache); 
17 | 		
18 | 		Structure structure = StructureIO.getStructure("1SMT");
19 | 			
20 | 		Chain chain = structure.getChainByPDB("A");
21 | 		
22 | 		// we want contacts between Calpha atoms only			
23 | 		String[] atoms = {" CA "};
24 | 		// the distance cutoff we use is 8A
25 | 		AtomContactSet contacts = StructureTools.getAtomsInContact(chain, atoms, 8.0);
26 | 
27 | 		System.out.println("Total number of CA-CA contacts: "+contacts.size());
28 | 
29 | 
30 | ```
31 | 
32 | The algorithm to find the contacts uses spatial hashing without need to calculate a full distance matrix, thus it scales nicely.
33 | 
34 | ## Getting the contacts between two protein chains
35 | 
36 | One can also find the contacting atoms between two protein chains. For instance the following code finds the contacts between the first 2 chains of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT):
37 | 
38 | ```java
39 | 		AtomCache cache = new AtomCache();
40 | 		StructureIO.setAtomCache(cache); 
41 | 		
42 | 		Structure structure = StructureIO.getStructure("1SMT");
43 | 			
44 | 		AtomContactSet contacts = 
45 | 			StructureTools.getAtomsInContact(structure.getChain(0), structure.getChain(1), 5, false);
46 | 		
47 | 		System.out.println("Total number of atom contacts: "+contacts.size());
48 | 		
49 | 		// the list of atom contacts can be reduced to a list of contacts between groups:
50 | 		GroupContactSet groupContacts = new GroupContactSet(contacts);
51 | ```
52 | 
53 | 
54 | See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above.
55 | 
56 | 
57 | 
58 | [Holm 1993]: http://www.biomedcentral.com/pubmed/8377180
59 | [Caprara 2004]: http://www.biomedcentral.com/pubmed/15072687
60 | [Alexandrov 2003]: http://www.biomedcentral.com/pubmed/12584135
61 | [Emmert-Streib 2007]: http://www.biomedcentral.com/pubmed/17608939
62 | [Benkert 2008]: http://www.biomedcentral.com/pubmed/17932912
63 | [Jones 2012]: http://www.ncbi.nlm.nih.gov/pubmed/22101153
64 | 
65 | <!--automatically generated footer-->
66 | 
67 | ---
68 | 
69 | Navigation:
70 | [Home](../README.md)
71 | | [Book 3: The Structure Modules](README.md)
72 | | Chapter 12 : Contacts Within a Chain and between Chains
73 | 
74 | Prev: [Chapter 11 : Accessible Surface Areas](asa.md)
75 | 
76 | Next: [Chapter 13 - Finding all Interfaces in Crystal: Crystal Contacts](crystal-contacts.md)
77 | 


--------------------------------------------------------------------------------
/structure/crystal-contacts.md:
--------------------------------------------------------------------------------
 1 | # How to find all crystal contacts in a PDB structure
 2 | 
 3 | ## Why crystal contacts?
 4 | 
 5 | A protein structure is determined by X-ray diffraction from a protein crystal, i.e. an infinite lattice of molecules. Thus the end result of the diffraction experiment is a crystal lattice and not just a single molecule. However the PDB file only contains the coordinates of the Asymmetric Unit (AU), defined as the minimum unit needed to reconstruct the full crystal using symmetry operators.
 6 | 
 7 | Looking at the AU alone is not enough to understand the crystal structure. For instance the biologically relevant assembly (known as the Biological Unit) can occur through a symmetry operator that can be found looking at the crystal contacts. See for instance [1M4N](http://www.rcsb.org/pdb/explore.do?structureId=1M4N): its biological unit is a dimer that happens through a 2-fold operator and is the largest interface found in the crystal. 
 8 | 
 9 | Looking at crystal contacts can also be important in order to assess the quality and reliability of the deposited PDB model: an AU can look perfectly fine but then upon reconstruction of the lattice the molecules can be clashing, which indicates that something is wrong in the model.
10 | 
11 | 
12 | ## Getting the set of unique contacts in the crystal lattice
13 | 
14 | This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT):
15 | 
16 | ```java
17 | 		AtomCache cache = new AtomCache();
18 | 		
19 | 		StructureIO.setAtomCache(cache); 
20 | 		
21 | 		Structure structure = StructureIO.getStructure("1SMT");
22 | 			
23 | 		CrystalBuilder cb = new CrystalBuilder(structure);
24 | 		
25 | 		// 6 is the distance cutoff to consider 2 atoms in contact
26 | 		StructureInterfaceList interfaces = cb.getUniqueInterfaces(6);
27 | 		
28 | 		System.out.println("The crystal contains "+interfaces.size()+" unique interfaces");
29 | 
30 | 		// this calculates the buried surface areas of all interfaces and sorts them by areas
31 | 		interfaces.calcAsas(3000, 1, -1);
32 | 
33 | 		// we can get the largest interface in the crystal and look at its area
34 | 		interfaces.get(1).getTotalArea();
35 | 
36 | ```
37 | 
38 | An interface is defined here as any 2 chains with at least a pair of atoms within the given distance cutoff (6 A in the example above). 
39 | 
40 | The algorithm to find all unique interfaces in the crystal works roughly like this:
41 | + Reconstructs the full unit cell by applying the matrix operators of the corresponding space group to the Asymmetric Unit.
42 | + Searches all cells around the original one by applying crystal translations, if any 2 chains in that search is found to contact then the new contact is added to the final list.
43 | + The search is performend without repeating redundant symmetry operators, making sure that if a contact is found then it is a unique contact.
44 | 
45 | See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above.
46 | 
47 | ## Clustering the interfaces
48 | One can also cluster the interfaces based on their similarity. The similarity is measured through contact overlap: number of common contacts over average number of contact in both chains. The clustering can be done as following:
49 | 
50 | ```java
51 | 		List<StructureInterfaceCluster> clusters = interfaces.getClusters();
52 | 		for (StructureInterfaceCluster cluster:clusters) {
53 | 			System.out.print("Cluster "+cluster.getId()+" members: ");
54 | 			for (StructureInterface member:cluster.getMembers()) {
55 | 				System.out.print(member.getId()+" ");
56 | 			}
57 | 			System.out.println();
58 | 		}
59 | ```
60 | 
61 | 
62 | <!--automatically generated footer-->
63 | 
64 | ---
65 | 
66 | Navigation:
67 | [Home](../README.md)
68 | | [Book 3: The Structure Modules](README.md)
69 | | Chapter 13 - Finding all Interfaces in Crystal: Crystal Contacts
70 | 
71 | Prev: [Chapter 12 : Contacts Within a Chain and between Chains](contact-map.md)
72 | 
73 | Next: [Chapter 14 : Protein Symmetry](symmetry.md)
74 | 


--------------------------------------------------------------------------------
/structure/externaldb.md:
--------------------------------------------------------------------------------
  1 | External Databases
  2 | ==================
  3 | 
  4 | Biojava provides access to a number of external structural databases. These often use [caching](caching.md) to reduce the amount of data which must be downloaded from the database.
  5 | 
  6 | SCOP
  7 | ----
  8 | 
  9 | <table>
 10 | <tr><td>
 11 | 	   <img src="img/1dan_scop.png" width=300 />
 12 |     </td>
 13 | <td>
 14 | 	(Top) The structure <a href="http://www.rcsb.org/pdb/explore.do?structureId=1dan">1dan</a> contains four chains. <br/>
 15 | 
 16 |     (Bottom) These chains are broken up into six SCOP domains. The green chain L becomes 3 domains, while a combination of chains U (red) and T (orange) go to form the central purpal domain.
 17 | </td>
 18 | </tr>
 19 | </table>
 20 | 
 21 | The Structural Classification of Proteins (SCOP) is a manually curated classification of protein structural domains. It provides two pieces of data:
 22 | 
 23 | * The breakdown of a protein into structural domains
 24 | * A classification of domains according to their structure.
 25 | 
 26 | The structure for a known SCOP domain can be fetched via its 7-letter domain ID (eg 'd2bq6a1') via ```StructureIO.getStructure()```, as described in [Local PDB Installations](caching.md#Caching of other SCOP, CATH).
 27 | 
 28 | The SCOP classification can be accessed through the [```ScopDatabase```](http://www.biojava.org/docs/api/org/biojava/nbio/structure/scop/ScopDatabase.html) class.
 29 | 
 30 | ```java
 31 |     ScopDatabase scop = ScopFactory.getSCOP();
 32 | ```
 33 | 
 34 | ### Inspecting SCOP domains
 35 | 
 36 | A list of domains can be retrieved for a given protein.
 37 | 
 38 | ```java
 39 |     List<ScopDomain> domains = scop.getDomainsForPDB("4HHB");
 40 | ```
 41 | 
 42 | You can get lots of useful information from the [```ScopDomain```](http://www.biojava.org/docs/api/org/biojava/nbio/structure/scop/ScopDomain.html) object. 
 43 | 
 44 |     ScopDomain domain = domains.get(0);
 45 |     String scopID = domain.getScopId(); // d4hhba_
 46 |     String classification = domain.getClassificationId(); // a.1.1.2
 47 |     int sunId = domain.getSunId(); // 15251
 48 | 
 49 | ### Viewing the SCOP hierarchy
 50 | 
 51 | The full hierarchy is available as a tree of [```ScopNode```](http://www.biojava.org/docs/api/org/biojava/nbio/structure/scop/ScopNode.html)s, which can be easily traversed using their ```getParentSunid()``` and ```getChildren()``` methods.
 52 | 
 53 | ```java
 54 |     ScopNode node = scop.getScopNode(sunId);
 55 |     while (node != null){
 56 |         System.out.println(scop.getScopDescriptionBySunid(node.getSunid()));
 57 |         node = scop.getScopNode(node.getParentSunid());
 58 |     }
 59 | ```
 60 | 
 61 | ScopDatabase also provides access to all nodes at a particular level.
 62 | 
 63 | ```java
 64 |     List<ScopDescription> superfams = scop.getByCategory(ScopCategory.Superfamily);
 65 |     System.out.println("Total nr. of superfamilies:" + superfams.size());
 66 | ```
 67 | 
 68 | ### Types of ScopDatabase
 69 | 
 70 | Several types of ```ScopDatabase``` are available. These can be instantiated manually when more control is needed.
 71 | 
 72 | * __RemoteScopInstallation__ (default) Fetches data one node at a time from the internet. Useful when perfoming a small number of operations.
 73 | * __ScopeInstallation__ Downloads all SCOP data as a batch and caches it for later use. Much faster when performing many operations.
 74 | 
 75 | Several internal BioJava classes use ```ScopFactory.getSCOP()``` when they encounter references to SCOP domains, so it is always a good idea to notify the ```ScopFactory``` when using a custom ```ScopDatabase``` instance.
 76 | 
 77 | ```java
 78 |     ScopDatabase scop = new ScopInstallation();
 79 |     ScopFactory.setScopDatabase(scop);
 80 | ```
 81 | Several versions of SCOP are available.
 82 | 
 83 | ```java
 84 |     // Use Steven Brenner's updated version of SCOP
 85 |     scop = ScopFactory.getSCOP(ScopFactory.VERSION_1_75C);
 86 |     // Use an old version globally, perhaps for an older benchmark
 87 |     ScopFactory.setScopDatabase(ScopFactory.VERSION_1_69);
 88 | ```
 89 | 
 90 | CATH
 91 | ----
 92 | 
 93 | Cath can be accessed in a very similar fashion to SCOP. In parallel to the ScopInstallation class, there is a CathInstallation. Also, the StructureIO class allows to request by CATH ID. 
 94 | 
 95 | ```java
 96 | 
 97 |         private static final String DEFAULT_SCRIPT ="select * ; cartoon on; spacefill off; wireframe off; select ligands; wireframe on; spacefill on;";
 98 |         
 99 |         private static final String[] colors = new String[]{"red","green","blue","yellow"};
100 |     
101 |     public static void main(String args[]){
102 |         
103 |         UserConfiguration config = new UserConfiguration();
104 |         config.setPdbFilePath("/tmp/");
105 | 
106 |         String pdbID = "1DAN";
107 |         
108 |         CathDatabase cath = new CathInstallation(config.getPdbFilePath());
109 |         
110 |         List<CathDomain> domains = cath.getDomainsForPdb(pdbID);
111 |         
112 |         try {
113 |             
114 |             // show the structure in 3D
115 |             BiojavaJmol jmol = new BiojavaJmol();           
116 |             jmol.setStructure(StructureIO.getStructure(pdbID));         
117 |             jmol.evalString(DEFAULT_SCRIPT);
118 |             
119 |             System.out.println("got " + domains.size() + " domains");
120 |             
121 |             // now color the domains on the structure
122 |             int colorpos = -1;
123 |             
124 |             for ( CathDomain domain : domains){             
125 | 
126 |                 colorpos++;
127 |                 
128 |                 showDomain(jmol, domain,colorpos);
129 |             }
130 |                 
131 |             
132 |         } catch (Exception e) {
133 |             // TODO Auto-generated catch block
134 |             e.printStackTrace();
135 |         } 
136 |         
137 |     }
138 | 
139 |     
140 |     
141 |     private static void showDomain(BiojavaJmol jmol, CathDomain domain, int colorpos) {
142 |         List<CathSegment> segments = domain.getSegments();
143 |         
144 |         StructureName key = new StructureName(domain.getDomainName());
145 |         String chainId = key.getChainId();
146 |         
147 |         String color = colors[colorpos];
148 |         
149 |         System.out.println(" * domain " + domain.getDomainName() + " has # segments: " + domain.getSegments().size() + " color: " + color);
150 |         
151 |         for ( CathSegment segment : segments){
152 |             System.out.println("   * " + segment);
153 |             String start = segment.getStart();
154 |             
155 |             String stop = segment.getStop();
156 |                         
157 |             String script = "select " + start + "-" + stop+":"+chainId + "; color " + color +";";
158 |             
159 |             jmol.evalString(script );
160 |         }
161 |         
162 |     }
163 |  ```       
164 | 
165 | 
166 | <table>
167 |    <tr>
168 |         <td>
169 |             This will show the following
170 |         </td>
171 | 
172 |         <td>
173 |             and the text:
174 |         </td>
175 |     </tr>
176 |     
177 |     <tr>
178 |         <td>    
179 |             <img src="img/cath_1dan.png" width=300 />
180 |         </td>
181 |         <td>
182 |             <pre>
183 |                    
184 | got 4 domains
185 |  * domain 1danH01 has # segments: 2 color: red
186 |    * CathSegment [segmentId=1, start=16, stop=27, length=12, sequenceHeader=null, sequence=null]
187 |    * CathSegment [segmentId=2, start=121, stop=232, length=112, sequenceHeader=null, sequence=null]
188 |  * domain 1danH02 has # segments: 2 color: green
189 |    * CathSegment [segmentId=1, start=28, stop=120, length=93, sequenceHeader=null, sequence=null]
190 |    * CathSegment [segmentId=2, start=233, stop=246, length=14, sequenceHeader=null, sequence=null]
191 |  * domain 1danU00 has # segments: 1 color: blue
192 |    * CathSegment [segmentId=1, start=91, stop=210, length=120, sequenceHeader=null, sequence=null]
193 |  * domain 1danT00 has # segments: 1 color: yellow
194 |    * CathSegment [segmentId=1, start=6, stop=80, length=75, sequenceHeader=null, sequence=null]
195 |             </pre>
196 |       </td>
197 |     </tr>
198 | </table>
199 | 
200 |    
201 | 
202 | <!--automatically generated footer-->
203 | 
204 | ---
205 | 
206 | Navigation:
207 | [Home](../README.md)
208 | | [Book 3: The Structure Modules](README.md)
209 | | Chapter 10 : External Databases
210 | 
211 | Prev: [Chapter 9 : Biological Assemblies](bioassembly.md)
212 | 
213 | Next: [Chapter 11 : Accessible Surface Areas](asa.md)
214 | 


--------------------------------------------------------------------------------
/structure/firststeps.md:
--------------------------------------------------------------------------------
  1 | First Steps
  2 | ===========
  3 | 
  4 | ## First Steps
  5 | 
  6 | The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class.
  7 | 
  8 | ```java
  9 |      public static void main(String[] args) throws Exception {
 10 |            Structure structure = StructureIO.getStructure("4HHB");
 11 |            // and let's print out how many atoms are in this structure
 12 |            System.out.println(StructureTools.getNrAtoms(structure));
 13 |     }   
 14 | ```
 15 | 
 16 | BioJava  automatically downloads the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copies it into a temporary location. Then the PDB file parser loads the data into a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) object, that provides access to the content in the file. (If you call this a second time, BioJava will automatically re-use the local file.)
 17 | 
 18 | <table>
 19 |     <tr>
 20 |         <td>
 21 |             <a href="http://www.rcsb.org/pdb/explore.do?structureId=4hhb"><img src="img/4hhb_bio_r_250.jpg"/></a>
 22 |         </td>
 23 |         <td>
 24 |             The crystal structure of human deoxyhaemoglobin PDB ID <a href="http://www.rcsb.org/pdb/explore.do?structureId=4hhb">4HHB</a> (image source: <a href="http://www.rcsb.org/pdb/explore.do?structureId=4hhb">RCSB</a>)
 25 |     </tr>
 26 | </table>
 27 | 
 28 | This demonstrates two things:
 29 | 
 30 | + BioJava can automatically download and install files locally (more on this in Chapter 4)
 31 | + BioJava by default writes those files into a temporary location (The system temp directory "java.io.tempdir"). 
 32 | 
 33 | If you already have a local PDB installation, you can configure where BioJava should read the files from by setting the PDB_DIR system property
 34 | 
 35 | <pre>
 36 |     -DPDB_DIR=/wherever/you/want/
 37 | </pre>
 38 | 
 39 | ## Memory Consumption
 40 | 
 41 | Talking about startup properties, it is also good to mention the fact that many PDB entries are large molecules and the default 64k memory allowance for Java applications is not sufficient in many cases.  BioJava contains several built-in caches which automatically adjust to the available memory. As such, the more memory you grant your Java applicaiton, the better it can utilize the caches and the better the performance will be. Change the maximum heap space of your Java VM with this startup parameter:
 42 | 
 43 | <pre>
 44 |     -Xmx1G
 45 | </pre>
 46 | 
 47 | ## A Quick 3D View
 48 | 
 49 | If you have the *biojava-structure-gui* module installed, you can quickly visualise a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) via this:
 50 | 
 51 | ```java
 52 |     public static void main(String[] args) throws Exception {
 53 |         Structure struc = StructureIO.getStructure("4hhb");
 54 | 
 55 |         StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol();
 56 | 
 57 |         jmolPanel.setStructure(struc);
 58 | 
 59 |         // send some commands to Jmol
 60 |         jmolPanel.evalString("select * ; color chain;");            
 61 |         jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on;  ");
 62 |         jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;");         
 63 |     }
 64 | ```
 65 | 
 66 | This will result in the following view:
 67 | 
 68 | <table>
 69 |     <tr>
 70 |         <td>
 71 |             <img src="img/4hhb_jmol.png"/>
 72 |         </td>
 73 |         <td>
 74 |             The <a href="http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/gui/jmol/StructureAlignmentJmol.html">StructureAlignmentJmol</a> class provides a wrapper for the <a href="http://jmol.sourceforge.net/">Jmol</a> viewer and provides a bridge to BioJava, so Structure objects can be sent to Jmol for visualisation.
 75 |         </td>
 76 |     </tr>
 77 | </table>   
 78 | 
 79 | ## Asymmetric Unit and Biological Assembly
 80 | 
 81 | By default many people work with the *asymmetric unit* of a protein. However for many studies the correct representation to look at is the *biological assembly* of a protein. You can request it by calling
 82 | 
 83 | ```java
 84 |      public static void main(String[] args) throws Exception {
 85 |         Structure structure = StructureIO.getBiologicalAssembly("1GAV");
 86 |         // and let's print out how many atoms are in this structure
 87 |         System.out.println(StructureTools.getNrAtoms(structure));
 88 |     }
 89 | ```
 90 | 
 91 | This topic is important, so we dedicated a [whole chapter](bioassembly.md) to it.
 92 | 
 93 | ## I Loaded a Structure Object, What Now?
 94 | 
 95 | BioJava provides a number of algorithms and visualisation tools that you can use to further analyse the structure, or look at it. Here a couple of suggestions for further reads:
 96 | 
 97 | + [The BioJava Cookbook for protein structures](http://biojava.org/wiki/BioJava:CookBook#Protein_Structure)
 98 | + How does BioJava [represent the content](structure-data-model.md) of a PDB/mmCIF file?
 99 | + How to calculate a protein structure alignment using BioJava: [tutorial](alignment.md) or [cookbook](http://biojava.org/wiki/BioJava:CookBook:PDB:align)
100 | + [How to work with Groups (AminoAcid, Nucleotide, Hetatom)](http://biojava.org/wiki/BioJava:CookBook:PDB:groups)
101 | 
102 | 
103 | 
104 | 
105 | <!--automatically generated footer-->
106 | 
107 | ---
108 | 
109 | Navigation:
110 | [Home](../README.md)
111 | | [Book 3: The Structure Modules](README.md)
112 | | Chapter 2 : First Steps
113 | 
114 | Prev: [Chapter 1 : Installation](installation.md)
115 | 
116 | Next: [Chapter 3 : Structure Data Model](structure-data-model.md)
117 | 


--------------------------------------------------------------------------------
/structure/img/143px-Selenomethionine-from-xtal-3D-balls.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/143px-Selenomethionine-from-xtal-3D-balls.png


--------------------------------------------------------------------------------
/structure/img/1cfd_1cll_fatcat.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1cfd_1cll_fatcat.png


--------------------------------------------------------------------------------
/structure/img/1cfd_1cll_fatcat.xcf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1cfd_1cll_fatcat.xcf


--------------------------------------------------------------------------------
/structure/img/1cfd_1cll_flexible.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1cfd_1cll_flexible.png


--------------------------------------------------------------------------------
/structure/img/1cfd_1cll_rigid.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1cfd_1cll_rigid.png


--------------------------------------------------------------------------------
/structure/img/1dan_scop.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1dan_scop.png


--------------------------------------------------------------------------------
/structure/img/1gav_asym.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1gav_asym.png


--------------------------------------------------------------------------------
/structure/img/1gav_biounit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1gav_biounit.png


--------------------------------------------------------------------------------
/structure/img/1hho_asym.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1hho_asym.png


--------------------------------------------------------------------------------
/structure/img/1hho_biounit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1hho_biounit.png


--------------------------------------------------------------------------------
/structure/img/1m4x_bio_r_250.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1m4x_bio_r_250.jpg


--------------------------------------------------------------------------------
/structure/img/2hyn_1zll.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/2hyn_1zll.png


--------------------------------------------------------------------------------
/structure/img/3cna.A_2pel.A_cecp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/3cna.A_2pel.A_cecp.png


--------------------------------------------------------------------------------
/structure/img/4hhb_bio_r_250.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/4hhb_bio_r_250.jpg


--------------------------------------------------------------------------------
/structure/img/4hhb_jmol.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/4hhb_jmol.png


--------------------------------------------------------------------------------
/structure/img/alignment_gui.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/alignment_gui.png


--------------------------------------------------------------------------------
/structure/img/alignmentpanel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/alignmentpanel.png


--------------------------------------------------------------------------------
/structure/img/cath_1dan.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/cath_1dan.png


--------------------------------------------------------------------------------
/structure/img/database_search.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/database_search.png


--------------------------------------------------------------------------------
/structure/img/database_search_results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/database_search_results.png


--------------------------------------------------------------------------------
/structure/img/multiple_gui.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/multiple_gui.png


--------------------------------------------------------------------------------
/structure/img/multiple_jmol_globins.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/multiple_jmol_globins.png


--------------------------------------------------------------------------------
/structure/img/multiple_panel_globins.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/multiple_panel_globins.png


--------------------------------------------------------------------------------
/structure/img/symm_combined.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_combined.png


--------------------------------------------------------------------------------
/structure/img/symm_helical.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_helical.png


--------------------------------------------------------------------------------
/structure/img/symm_hierarchy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_hierarchy.png


--------------------------------------------------------------------------------
/structure/img/symm_internal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_internal.png


--------------------------------------------------------------------------------
/structure/img/symm_local.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_local.png


--------------------------------------------------------------------------------
/structure/img/symm_pg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_pg.png


--------------------------------------------------------------------------------
/structure/img/symm_pseudo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_pseudo.png


--------------------------------------------------------------------------------
/structure/img/symm_subunits.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_subunits.png


--------------------------------------------------------------------------------
/structure/installation.md:
--------------------------------------------------------------------------------
 1 | ## Quick Installation
 2 | 
 3 | In the beginning, just one quick paragraph of how to get access to BioJava.
 4 | 
 5 | BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way:
 6 | 
 7 | BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html)  guide.
 8 | 
 9 | As of version 4, BioJava is available in maven central. This is all you would need to add a BioJava dependency to your projects:
10 | 
11 | ```xml
12 |         <dependencies>
13 |                 ...
14 |                 <dependency>
15 |                         <!-- This imports the latest SNAPSHOT builds from the protein structure modules of BioJava.
16 |                         -->                        
17 |                         <groupId>org.biojava</groupId>
18 |                         <artifactId>biojava-structure</artifactId>
19 |                         <version>4.2.0</version>
20 |                 </dependency>
21 |                 <!-- if you want to use the visualisation tools you need also this one: -->
22 |                 <dependency>                                         
23 |                         <groupId>org.biojava</groupId>
24 |                         <artifactId>biojava-structure-gui</artifactId>
25 |                         <version>4.2.0</version>
26 |                 </dependency>
27 |                 <!-- other biojava jars as needed -->
28 |         </dependencies> 
29 | ```
30 | 
31 | If you run 
32 | 
33 | <pre>
34 |     mvn package
35 | </pre>
36 | 
37 |  on your project, the BioJava dependencies will be automatically downloaded and installed for you.
38 | 
39 | ### (Optional) Configuration
40 | 
41 | BioJava can be configured through several properties:
42 | 
43 | | Property | Description |
44 | | --- | --- |
45 | | `PDB_DIR` | Directory for caching structure files from the PDB. Mirrors the PDB's FTP server directory structure, with `PDB_DIR` equivalent to ftp://ftp.wwpdb.org/pub/pdb/. Default: temp directory |
46 | | `PDB_CACHE_DIR` | Cache directory for other files related to the structure package. Default: temp directory |
47 | 
48 | These can be set either as java properties or as environmental variables. For example:
49 | 
50 | ```
51 | # This could be added to .bashrc
52 | export PDB_DIR=...
53 | # Or override for a particular execution
54 | java -DPDB_DIR=... -cp ...
55 | ```
56 | 
57 | Note that your IDE may ignore `.bashrc` settings, but should have a preference for passing VM arguments.
58 | 
59 | <!--automatically generated footer-->
60 | 
61 | ---
62 | 
63 | Navigation:
64 | [Home](../README.md)
65 | | [Book 3: The Structure Modules](README.md)
66 | | Chapter 1 : Installation
67 | 
68 | Next: [Chapter 2 : First Steps](firststeps.md)
69 | 


--------------------------------------------------------------------------------
/structure/lists.md:
--------------------------------------------------------------------------------
 1 | # Lists of PDB IDs and PDB Status Information
 2 | 
 3 | ## Get a list of all current PDB IDs
 4 | 
 5 | The following code connects to one of the PDB servers and fetches a list of all current PDB IDs.
 6 | 
 7 | ```java
 8 |     SortedSet<String> currentPDBIds = PDBStatus.getCurrentPDBIds();
 9 | ```     
10 | 
11 | ## The current status of a PDB entry
12 | 
13 | The following provides information about the status of a PDB entry
14 | 
15 | ```java
16 |     Status status = PDBStatus.getStatus("4hhb");
17 | 
18 |     // get the current ID for an obsolete entry
19 |     String currentID = PDBStatus.getCurrent("1hhb"); 
20 | ```   
21 | 
22 | 
23 | <!--automatically generated footer-->
24 | 
25 | ---
26 | 
27 | Navigation:
28 | [Home](../README.md)
29 | | [Book 3: The Structure Modules](README.md)
30 | | Chapter 18 : Status Information
31 | 
32 | Prev: [Chapter 17 : Special Cases](special.md)
33 | 


--------------------------------------------------------------------------------
/structure/mmcif.md:
--------------------------------------------------------------------------------
  1 | # How to Parse mmCIF Files using BioJava
  2 | 
  3 | A quick tutorial how to work with mmCIF files.
  4 | 
  5 | ## What is mmCIF?
  6 | 
  7 | The Protein Data Bank (PDB) has been distributing its archival files as PDB files for a long time. The PDB file format is based on "punchcard"-style rules how to store data in a flat file. With the increasing complexity of macromolecules that are being resolved experimentally, this file format can not be used any more to represent some or the more complex structures. As such, the wwPDB recently announced the transition from PDB to mmCIF/PDBx as  the principal deposition and dissemination file format (see 
  8 | [here](http://www.wwpdb.org/news/news_2013.html#22-May-2013) and 
  9 | [here](http://wwpdb.org/workshop/wgroup.html)). 
 10 | 
 11 | The mmCIF file format has been around for some time (see [Westbrook 2000][] and [Westbrook 2003][] ) [BioJava](http://www.biojava.org) has been supporting mmCIF already for several years. This tutorial is meant to provide a quick introduction into how to parse mmCIF files using [BioJava](http://www.biojava.org)
 12 | 
 13 | ## The Basics
 14 | 
 15 | BioJava uses the [CIFTools-java](https://github.com/rcsb/ciftools-java) library to parse mmCIF. BioJava then has its own data model that reads PDB and mmCIF files 
 16 | into a biological and chemically  meaningful data model (BioJava supports the [Chemical Components Dictionary](chemcomp.md)). 
 17 | If you don't want to use that data model, you can still use the CIFTools-java parser, please refer to its documentation. 
 18 | Let's start first with the most basic way of loading a protein structure.
 19 | 
 20 | 
 21 | ## First Steps
 22 | 
 23 | The simplest way to load a PDBx/mmCIF file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class.
 24 | 
 25 | ```java
 26 |     Structure structure = StructureIO.getStructure("4HHB");
 27 |     // and let's print out how many atoms are in this structure
 28 |     System.out.println(StructureTools.getNrAtoms(structure));
 29 | ```
 30 | 
 31 | BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things:
 32 | 
 33 | + BioJava can automatically download and install files locally
 34 | + BioJava by default writes those files into a temporary location (The system temp directory "java.io.tempdir"). 
 35 | 
 36 | If you already have a local PDB installation, you can configure where BioJava should read the files from by setting the PDB_DIR system property
 37 | 
 38 | <pre>
 39 |     -DPDB_DIR=/wherever/you/want/
 40 | </pre>
 41 | 
 42 | ## Switching AtomCache to use different file types
 43 | 
 44 | By default BioJava is using the BCIF file format for parsing data. In order to switch it to use mmCIF, we can take control over 
 45 | the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which 
 46 | manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations.
 47 | 
 48 | ```java
 49 |         AtomCache cache = new AtomCache();
 50 | 
 51 |         cache.setFiletype(StructureFiletype.CIF);
 52 |             
 53 |         // if you struggled to set the PDB_DIR property correctly in the previous step, 
 54 |         // you could set it manually like this:
 55 |         cache.setPath("/tmp/");
 56 |             
 57 |         StructureIO.setAtomCache(cache);
 58 |             
 59 |         Structure structure = StructureIO.getStructure("4HHB");
 60 |                     
 61 |         // and let's count how many chains are in this structure.
 62 |         System.out.println(structure.getChains().size());
 63 | ```
 64 | 
 65 | See other supported file types in the `StructureFileType` enum.
 66 | 
 67 | ## URL based parsing of files
 68 | 
 69 | StructureIO can also access files via URLs and fetch the data dynamically. E.g. the following code shows how to load a file from a remote server. 
 70 | 
 71 | ```java
 72 |         String u = "http://ftp.wwpdb.org/pub/pdb/data/biounit/mmCIF/divided/nw/4nwr-assembly1.cif.gz";
 73 |         Structure s = StructureIO.getStructure(u);
 74 |         System.out.println(s);
 75 | ```
 76 | 
 77 | ### Local URLs
 78 | BioJava can also access local files, by specifying the URL as 
 79 | 
 80 | <pre>
 81 |     file:///path/to/local/file
 82 | </pre>
 83 | 
 84 | 
 85 | ## Low Level Access
 86 | 
 87 | You can load a BioJava `Structure` object using the ciftools-java parser with:
 88 | 
 89 | ```java
 90 |         InputStream inStream =  new FileInputStream(fileName);
 91 |         // now get the protein structure.
 92 |         Structure cifStructure = CifStructureConverter.fromInputStream(inStream);
 93 | ```
 94 | 
 95 | ## I Loaded a Structure Object, What Now?
 96 | 
 97 | BioJava provides a number of algorithms and visualisation tools that you can use to further analyse the structure, or look at it. Here a couple of suggestions for further reads:
 98 | 
 99 | + [The BioJava Cookbook for protein structures](http://biojava.org/wiki/BioJava:CookBook#Protein_Structure)
100 | + How does BioJava [represent the content](structure-data-model.md) of a PDB/mmCIF file?
101 | + How to calculate a protein structure alignment using BioJava: [tutorial](alignment.md) or [cookbook](http://biojava.org/wiki/BioJava:CookBook:PDB:align)
102 | + [How to work with Groups (AminoAcid, Nucleotide, Hetatom)](http://biojava.org/wiki/BioJava:CookBook:PDB:groups)
103 | 
104 | ## Further reading
105 | 
106 | See the [http://mmcif.rcsb.org/](http://mmcif.rcsb.org/) site for more documentation on mmcif.
107 | 
108 | 
109 | <!-- References -->
110 | 
111 | 
112 | [Westbrook 2000]: http://www.ncbi.nlm.nih.gov/pubmed/10842738 "Westbrook JD and Bourne PE. STAR/mmCIF: an ontology for macromolecular structure. Bioinformatics 2000 Feb; 16(2) 159-68. pmid:10842738." 
113 | 
114 | [Westbrook 2003]: http://www.ncbi.nlm.nih.gov/pubmed/12647386 "Westbrook JD and Fitzgerald PM. The PDB format, mmCIF, and other data formats. Methods Biochem Anal 2003; 44 161-79. pmid:12647386."
115 | 
116 | 
117 | <!--automatically generated footer-->
118 | 
119 | ---
120 | 
121 | Navigation:
122 | [Home](../README.md)
123 | | [Book 3: The Structure Modules](README.md)
124 | | Chapter 6 : Work with mmCIF/PDBx Files
125 | 
126 | Prev: [Chapter 5 : Chemical Component Dictionary](chemcomp.md)
127 | 
128 | Next: [Chapter 7 : SEQRES and ATOM Records](seqres.md)
129 | 


--------------------------------------------------------------------------------
/structure/secstruc.md:
--------------------------------------------------------------------------------
  1 | Protein Secondary Structure
  2 | ===========================
  3 | 
  4 | ## What is Protein Secondary Structure?
  5 | 
  6 | Protein secondary structure (SS) is the general three-dimensional form of local segments of proteins. 
  7 | Secondary structure can be formally defined by the pattern of hydrogen bonds of the protein 
  8 | (such as alpha helices and beta sheets) that are observed in an atomic-resolution structure. 
  9 | 
 10 | More specifically, the secondary structure is defined by the patterns of hydrogen bonds formed between 
 11 | amine hydrogen (-NH) and carbonyl oxygen (C=O) atoms contained in the backbone peptide bonds of the protein. 
 12 | 
 13 | For more info see the Wikipedia article 
 14 | on [protein secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure).
 15 | 
 16 | ## Secondary Structure Annotation
 17 | 
 18 | ### Information Sources
 19 | 
 20 | There are various ways to obtain the SS annotation of a protein structure:
 21 | 
 22 | - **Authors assignment**: the authors of the structure describe the SS, usually identifying helices 
 23 | and beta-sheets, and they assign the corresponding type to each residue involved. The authors assignment
 24 | can be found in the `PDB` and `mmCIF` file formats deposited in the PDB, and it can be parsed in **BioJava**
 25 | when a `Structure` is loaded.
 26 | 
 27 | - **Assignment from Atom coordinates**: there exist various programs to assign the SS of a protein. 
 28 | The algorithms use the atom coordinates of the aminoacids to determine hydrogen bonds and geometrical patterns 
 29 | that define the different types of protein secondary structure. One of the first and most popular algorithms 
 30 | is `DSSP` (Dictionary of Secondary Structure of Proteins). **BioJava** has an implementation of the algorithm, 
 31 | written originally in C++, which will be described in the next section.
 32 | 
 33 | - **Prediction from sequence**: Other algorithms use only the aminoacid sequence (primary structure) of the protein,
 34 | nd predict the SS using the SS propensities of each aminoacid and multiple alignments with homologous sequences 
 35 | (i.e. [PSIPRED](http://bioinf.cs.ucl.ac.uk/psipred/)). At the moment **BioJava** does not have an implementation 
 36 | of this type, which would be more suitable for the sequence and alignment modules.
 37 | 
 38 | ### Secondary Structure Types
 39 | 
 40 | Following the `DSSP` convention, **BioJava** defines 8 types of secondary structure:
 41 | 
 42 |     E = extended strand, participates in β ladder
 43 |     B = residue in isolated β-bridge
 44 |     H = α-helix
 45 |     G = 3-helix (3-10 helix)
 46 |     I = 5-helix (π-helix)
 47 |     T = hydrogen bonded turn
 48 |     S = bend
 49 |     _ = loop (any other type)
 50 | 
 51 | ## Parsing Secondary Structure in BioJava
 52 | 
 53 | Currently there exist two alternatives to parse the secondary structure in **BioJava**: either from the PDB/mmCIF
 54 | files of deposited structures (author assignment) or from the output file of a DSSP prediction. Both file types
 55 | can be obtained from the PDB serevers, if available, so they can be automatically fetched by BioJava. 
 56 | 
 57 | As an example,you can find here the links of the structure **5PTI** to its 
 58 | [PDB file](http://www.rcsb.org/pdb/files/5PTI.pdb) (search for the HELIX and SHEET lines) and its 
 59 | [DSSP file](http://www.rcsb.org/pdb/files/5PTI.dssp).
 60 | 
 61 | Note that the DSSP prediction output is more detailed and complete than the authors assignment. 
 62 | The choice of one or the other will depend on the use case. 
 63 | 
 64 | Below you can find some examples of how to parse and assign the SS of a `Structure`:
 65 | 
 66 | ```java
 67 |     String pdbID = "5pti";
 68 |     FileParsingParameters params = new FileParsingParameters();
 69 |     //Only change needed to the normal Structure loading
 70 |     params.setParseSecStruc(true); //this is false as DEFAULT
 71 | 
 72 |     AtomCache cache = new AtomCache();
 73 |     cache.setFileParsingParams(params);
 74 | 
 75 |     //The loaded Structure contains the SS assigned
 76 |     Structure s = cache.getStructure(pdbID);
 77 |     
 78 |     //If the more detailed DSSP prediction is required call this afterwards
 79 |     DSSPParser.fetch(pdbID, s, true); //Second parameter true overrides the previous SS
 80 | ```
 81 | 
 82 | For more examples search in the **demo** package for `DemoLoadSecStruc`.
 83 | 
 84 | ## Assignment of Secondary Structure in BioJava
 85 | 
 86 | ### Algorithm
 87 | 
 88 | The algorithm implemented in BioJava for the assignment of SS is `DSSP`. It is described in the paper from 
 89 | [Kabsch W. & Sander C. in 1983](http://onlinelibrary.wiley.com/doi/10.1002/bip.360221211/abstract) 
 90 | [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/6667333).
 91 | A brief explanation of the algorithm and the output format can be found
 92 | [here](http://swift.cmbi.ru.nl/gv/dssp/DSSP_3.html).
 93 | 
 94 | The interface is very easy: a single method, named *calculate()*, calculates the SS and can assign it to the
 95 | input Structure overriding any previous annotation, like in the DSSPParser. An example can be found below:
 96 | 
 97 | ```java
 98 |     String pdbID = "5pti";
 99 |     AtomCache cache = new AtomCache();
100 |     
101 |     //Load structure without any SS assignment
102 |     Structure s = cache.getStructure(pdbID);
103 |         
104 |     //Predict and assign the SS of the Structure
105 |     SecStrucCalc ssp = new SecStrucCalc(); //Instantiation needed
106 |     ssp.calculate(s, true); //true assigns the SS to the Structure
107 | ```
108 | 
109 | BioJava Class: 
110 | [org.biojava.nbio.structure.secstruc.SecStrucCalc](http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html)
111 | 
112 | ### Storage and Data Structures
113 | 
114 | Because there are different sources of SS annotation, the data structure in **BioJava** that stores SS assignments 
115 | has two levels. The top level `SecStrucInfo` is very general and only contains two properties: **assignment**
116 | (String describing the source of information) and **type** the SS type.
117 | 
118 | However, there is an extended container `SecStrucState`, which is a subclass of `SecStrucInfo`, that stores
119 | all the information of the hydrogen bonding, turns, bends, etc. used for the SS prediction and present in the
120 | DSSP output file format. This information is only used in certain applications, and that is the reason for the
121 | more general `SecStrucInfo` class being used by default.
122 | 
123 | In order to access the SS information of a `Structure`, the `SecStrucInfo` object needs to be obtained from the
124 | `Group` properties. Below you find an example of how to access and print residue by residue the SS information of 
125 | a `Structure`:
126 | 
127 | ```java
128 |     //This structure should have SS assigned (by any of the methods described)
129 |     Structure s;
130 | 
131 |     for (Chain c : s.getChains()) {
132 |         for (Group g: c.getAtomGroups()){
133 |             if (g.hasAminoAtoms()){ //Only AA store SS
134 |                 //Obtain the object that stores the SS
135 |                 SecStrucInfo ss = (SecStrucInfo) g.getProperty(Group.SEC_STRUC);
136 |                 //Print information: chain+resn+name+SS
137 |                 System.out.println(c.getChainID()+" "+
138 |                     g.getResidueNumber()+" "+
139 |                     g.getPDBName()+" -> "+ss);
140 |             }
141 |         }
142 |     }
143 | ```
144 | 
145 | ### Output Formats
146 | 
147 | Once the SS has been assigned (either loaded or calculated), there are some easy formats to visualize it in **BioJava**:
148 | 
149 | - **DSSP format**: the SS can be printed as a DSSP oputput file format, following the standards so that it can be
150 | parsed again. It is the safest way to serialize a SS annotation and recover it later, but it is probably the most 
151 | complicated to visualize.
152 | 
153 | <pre>
154 |   #  RESIDUE AA STRUCTURE BP1 BP2  ACC     N-H-->O    O-->H-N    N-H-->O    O-->H-N    TCO  KAPPA ALPHA  PHI   PSI    X-CA   Y-CA   Z-CA 
155 |     1    1 A R              0   0  168      0, 0.0    54,-0.1     0, 0.0     5,-0.1   0.000 360.0 360.0 360.0 139.2   32.2   14.7  -11.8
156 |     2    2 A P    >   -     0   0   45      0, 0.0     3,-1.8     0, 0.0     4,-0.3  -0.194 360.0-122.0 -61.4 144.9   34.9   13.6   -9.4
157 |     3    3 A D  G >  S+     0   0  122      1,-0.3     3,-1.6     2,-0.2     4,-0.2   0.790 108.3  71.4 -62.8 -28.5   35.8   10.0   -9.5
158 |     4    4 A F  G >  S+     0   0   26      1,-0.3     3,-1.7     2,-0.2    -1,-0.3   0.725  83.7  70.4 -64.1 -23.3   35.0    9.7   -5.9
159 | </pre>
160 | 
161 | - **FASTA format**: simple format that prints the SS type of each residue sequentially in the order of the aminoacids.
162 | It is the easiest to visualize, but the less informative of all.
163 | 
164 | <pre>
165 | >5PTI_SS-annotation
166 |   GGGGS     S    EEEEEEETTTTEEEEEEE SSS  SS BSSHHHHHHHH   
167 | </pre>
168 | 
169 | - **Helix Summary**: similar to the FASTA format, but contain also information about the helical turns.
170 | 
171 | <pre>
172 | 3 turn:  >>><<<                                                   
173 | 4 turn:                        >444<                  >>>>XX<<<<  
174 | 5 turn:                        >5555<                             
175 | SS:       GGGGS     S    EEEEEEETTTTEEEEEEE SSS  SS BSSHHHHHHHH   
176 | AA:     RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA
177 | </pre>
178 | 
179 | - **Secondary Structure Elements**: another way to visualize the SS annotation is by compacting those sequential residues that share the same SS type and assigning an ID to the range. In this way, a structure can be described by
180 | a collection of helices, strands, turns, etc. and each one of the elements can be identified by an ID (i.e. helix 1 (H1), beta-strand 6 (E6), etc).
181 | 
182 | <pre>
183 | G1: 3 - 6
184 | S1: 7 - 7
185 | S2: 13 - 13
186 | E1: 18 - 24
187 | T1: 25 - 28
188 | E2: 29 - 35
189 | S3: 37 - 39
190 | S4: 42 - 43
191 | B1: 45 - 45
192 | S5: 46 - 47
193 | H1: 48 - 55
194 | </pre>
195 | 
196 | You can find examples of how to get the different file formats in the class `DemoSecStrucPred` in the **demo**
197 | package.
198 | 
199 | ### Example
200 | 
201 | Use dependencies from maven
202 | 
203 | ```xml
204 | <dependency>
205 |     <groupId>org.biojava</groupId>
206 |     <artifactId>biojava-core</artifactId>
207 |     <version>4.2.4</version>
208 | </dependency>
209 | <dependency>
210 |     <groupId>org.biojava</groupId>
211 |     <artifactId>biojava-modfinder</artifactId>
212 |     <version>4.2.4</version>
213 | </dependency>
214 | ```
215 | 
216 | This is taken from the DemoLoadSecStruc example in the **demo** package.
217 | 
218 | ```java
219 | 
220 | import org.biojava.nbio.structure.Structure;
221 | import org.biojava.nbio.structure.StructureException;
222 | import org.biojava.nbio.structure.align.util.AtomCache;
223 | import org.biojava.nbio.structure.io.FileParsingParameters;
224 | import org.biojava.nbio.structure.secstruc.DSSPParser;
225 | import org.biojava.nbio.structure.secstruc.SecStrucCalc;
226 | import org.biojava.nbio.structure.secstruc.SecStrucInfo;
227 | import org.biojava.nbio.structure.secstruc.SecStrucTools;
228 | 
229 | public static void main(String[] args) throws IOException,
230 | 			StructureException {
231 | 
232 | 		String pdbID = "5pti";
233 | 
234 | 		// Only change needed to the DEFAULT Structure loading
235 | 		FileParsingParameters params = new FileParsingParameters();
236 | 		params.setParseSecStruc(true);
237 | 
238 | 		AtomCache cache = new AtomCache();
239 | 		cache.setFileParsingParams(params);
240 | 
241 | 		// Use PDB format, because SS cannot be parsed from mmCIF yet
242 | 		cache.setUseMmCif(false);
243 | 
244 | 		// The loaded Structure contains the SS assigned by Author (simple)
245 | 		Structure s = cache.getStructure(pdbID);
246 | 
247 | 		// Print the Author's assignment (from PDB file)
248 | 		System.out.println("Author's assignment: ");
249 | 		printSecStruc(s);
250 | 
251 | 		// If the more detailed DSSP prediction is required call this
252 | 		DSSPParser.fetch(pdbID, s, true);
253 | 
254 | 		// Print the assignment residue by residue
255 | 		System.out.println("DSSP assignment: ");
256 | 		printSecStruc(s);
257 | 
258 | 		// finally use BioJava's built in DSSP-like secondary structure assigner
259 | 		SecStrucCalc secStrucCalc = new SecStrucCalc();
260 | 
261 | 		// calculate and assign
262 | 		secStrucCalc.calculate(s,true);
263 | 		printSecStruc(s);
264 | 
265 | 	}
266 | 
267 | 	public static void printSecStruc(Structure s){
268 | 		List<SecStrucInfo> ssi = SecStrucTools.getSecStrucInfo(s);
269 | 		for (SecStrucInfo ss : ssi) {
270 | 			System.out.println(ss.getGroup().getChain().getName() + " "
271 | 					+ ss.getGroup().getResidueNumber() + " "
272 | 					+ ss.getGroup().getPDBName() + " -> " + ss.toString());
273 | 		}
274 | 	}
275 | ```
276 | 
277 | 
278 | <!--automatically generated footer-->
279 | 
280 | ---
281 | 
282 | Navigation:
283 | [Home](../README.md)
284 | | [Book 3: The Structure Modules](README.md)
285 | | Chapter 15 : Protein Secondary Structure
286 | 
287 | Prev: [Chapter 14 : Protein Symmetry](symmetry.md)
288 | 
289 | Next: [Chapter 17 : Special Cases](special.md)
290 | 


--------------------------------------------------------------------------------
/structure/seqres.md:
--------------------------------------------------------------------------------
  1 | SEQRES and ATOM Records, Mapping to Uniprot (SIFTs)
  2 | ===================================================
  3 | 
  4 | How molecular sequences are linked to experimentally observed atoms.
  5 | 
  6 | ## Sequences and Atoms
  7 | 
  8 | In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB often contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments).
  9 | 
 10 | Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of the regions that have been observed in an experiment and are available in the PDB map to UniProt.
 11 | 
 12 | ![Screenshot of Protein Feature View at RCSB](https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)")
 13 | 
 14 | As you can see, there are three PDB entries (PDB IDs [3LOH](http://www.rcsb.org/pdb/explore.do?structureId=3LOH), [2HR7](http://www.rcsb.org/pdb/explore.do?structureId=2RH7), [3BU3](http://www.rcsb.org/pdb/explore.do?structureId=3BU3)) that cover different regions of the UniProt sequence for the insulin receptor.
 15 | 
 16 | The blue-boxes are regions for which atoms records are available. For the grey regions there is sequence information available in the PDB, but no coordinates.
 17 | 
 18 | ## Seqres and Atom Records
 19 | 
 20 | The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequence as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure.
 21 | 
 22 | The **Atom** records provide coordinates where it was possible to observe them.
 23 | 
 24 | <pre>
 25 |     Seqres groups -> sequence that has been used in the experiment
 26 |     Atom groups   -> subset of Seqres groups for which coordinates could be obtained
 27 | </pre>    
 28 | 
 29 | The *mmCIF/PDBx* file format contains the information how the Seqres and atom records are mapped onto each other. However the *PDB format* does not clearly specify how to resolve this mapping. BioJava contains a utility class that maps the Seqres to the Atom records when parsing PDB files. This class performs an alignment using dynamic programming, which can slow down the parsing process. If you do not require the precise Seqres to Atom mapping, you can turn it off like this:
 30 | 
 31 | ```java
 32 |     AtomCache cache = new AtomCache();
 33 |             
 34 |     FileParsingParameters params = cache.getFileParsingParams();
 35 |             
 36 |     params.setAlignSeqRes(false);
 37 |             
 38 |     Structure structure = StructureIO.getStructure(...);
 39 |             
 40 | ```
 41 | 
 42 | ## Accessing Seqres and Atom Groups
 43 | 
 44 | By default BioJava loads both the Seqres and Atom groups into the [Chain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Chain.html) 
 45 | objects.
 46 | 
 47 | <pre>
 48 |     Chain   -> Seqres groups
 49 |             -> Atom groups
 50 | </pre>
 51 | 
 52 | Groups that are part of the Seqres sequence as well as of the Atom records are mapped onto each other. This means you
 53 | can iterate over all Seqres groups in a chain and check, if they have observed atoms.
 54 | 
 55 | ## Mapping from Uniprot to Atom Records 
 56 | 
 57 | The mapping between PDB and UniProt changes over time, due to the dynamic nature of biological data. The [PDBe](http://www.pdbe.org) has a project that provides up-to-date mappings between the two databases, the [SIFTs](http://www.ebi.ac.uk/pdbe/docs/sifts/) project. 
 58 | 
 59 | BioJava contains a parser for the SIFTs XML files. The [SiftsMappingProvider](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/sifts/SiftsMappingProvider.html) also acts similar to the AtomCache class, that we [discussed earlier](caching.md) and can automatically download and locally install SIFTs files.
 60 | 
 61 | Here, how to request the mapping for one particular PDB ID.
 62 | 
 63 | ```java
 64 |     List<SiftsEntity> entities = SiftsMappingProvider.getSiftsMapping("1gc1");
 65 |             
 66 |     for (SiftsEntity e : entities){
 67 |         System.out.println(e.getEntityId() + " " +e.getType());
 68 |         
 69 |         for ( SiftsSegment seg: e.getSegments()) {
 70 |             System.out.println(" Segment: " + seg.getSegId() + " " + seg.getStart() + " " + seg.getEnd()) ;
 71 |             
 72 |             for ( SiftsResidue res: seg.getResidues() ) {
 73 |                 System.out.println("  " + res);
 74 |             }
 75 |         }
 76 |         
 77 |     }
 78 | ```
 79 | 
 80 | This gives the following output:
 81 | 
 82 | <pre>
 83 |     C protein
 84 |  Segment: 1gc1_C_1_181 1 181
 85 |   SiftsResidue [pdbResNum=1, pdbResName=LYS, chainId=C, uniProtResName=K, uniProtPos=26, naturalPos=1, seqResName=LYS, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false]
 86 |   SiftsResidue [pdbResNum=2, pdbResName=LYS, chainId=C, uniProtResName=K, uniProtPos=27, naturalPos=2, seqResName=LYS, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false]
 87 |   SiftsResidue [pdbResNum=3, pdbResName=VAL, chainId=C, uniProtResName=V, uniProtPos=28, naturalPos=3, seqResName=VAL, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false]
 88 |   SiftsResidue [pdbResNum=4, pdbResName=VAL, chainId=C, uniProtResName=V, uniProtPos=29, naturalPos=4, seqResName=VAL, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false]
 89 |   SiftsResidue [pdbResNum=5, pdbResName=LEU, chainId=C, uniProtResName=L, uniProtPos=30, naturalPos=5, seqResName=LEU, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false]
 90 |   SiftsResidue [pdbResNum=6, pdbResName=GLY, chainId=C, uniProtResName=G, uniProtPos=31, naturalPos=6, seqResName=GLY, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false]
 91 |   SiftsResidue [pdbResNum=7, pdbResName=LYS, chainId=C, uniProtResName=K, uniProtPos=32, naturalPos=7, seqResName=LYS, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false]
 92 |   ...
 93 |  </pre>   
 94 | 
 95 |  As you can see for each residue in the Uniprot / PDB sequence the matching counterpart is provided (if there is one).
 96 | 
 97 | 
 98 | 
 99 | <!--automatically generated footer-->
100 | 
101 | ---
102 | 
103 | Navigation:
104 | [Home](../README.md)
105 | | [Book 3: The Structure Modules](README.md)
106 | | Chapter 7 : SEQRES and ATOM Records
107 | 
108 | Prev: [Chapter 6 : Work with mmCIF/PDBx Files](mmcif.md)
109 | 
110 | Next: [Chapter 8 : Structure Alignments](alignment.md)
111 | 


--------------------------------------------------------------------------------
/structure/special.md:
--------------------------------------------------------------------------------
  1 | # Special Cases When Working with Protein Structures
  2 | 
  3 | ## Alternate Locations
  4 | 
  5 | Some PDB entries contain alternate conformations for parts of a structure or a group. BioJava merges alternate conformations into a single group, for which alternative groups are available.
  6 | 
  7 | ```java
  8 | 			
  9 | 			Structure s = StructureIO.getStructure("1AAC");
 10 | 
 11 | 			Chain a = s.getChainByPDB("A");
 12 | 
 13 | 			Group g = a.getGroupByPDB( ResidueNumber.fromString("27"));
 14 | 
 15 | 			System.out.println(g);
 16 | 			for (Atom atom : g.getAtoms()) {
 17 | 				System.out.print(atom.toPDB());
 18 | 			}
 19 | 			
 20 | 			
 21 | 			int pos = 0;
 22 | 			for (Group alt: g.getAltLocs()) {
 23 | 				pos++;
 24 | 				System.out.println("altLoc: " + pos + " " + alt);
 25 | 				for (Atom atom : alt.getAtoms()) {
 26 | 					System.out.print(atom.toPDB());
 27 | 				}
 28 | 			} 
 29 | ```			
 30 | 
 31 | ## Insertion Codes
 32 | 
 33 | Insertion codes were introduced in the PDB, when people wanted to compare the "same" protein between different species. As it turned out the "same" protein was not showing exactly the same sequence in different species and in some cases insertions were found, resulting in a longer sequences. For the comparison of the proteins the numbering was considered important to be preserved. This was so one could say that for example "HIS 75" is an important residue. To make up for the mismatch in the lengths of the sequences insertion codes were introduced.  As a consequence, in PDB, a particular residue is identified uniquely by three data items: chain identifier, residue number, and insertion code. 
 34 | 
 35 | BioJava contains the ResidueNumber object to help with characterizing each group in a file. PDB ID 1IGY contains some extra residues around chain B position 82. BioJava can represent these like this:
 36 | 
 37 | ```java
 38 | 			Structure s1 = StructureIO.getStructure("1IGY");
 39 | 			
 40 | 			Chain b = s1.getChainByPDB("B");
 41 | 			
 42 | 			for (Group g : b.getAtomGroups()){
 43 | 				System.out.println(g.getResidueNumber() + " " + g.getPDBName() + " " + g.getResidueNumber().getInsCode());
 44 | 			}
 45 | 			
 46 | ```
 47 | 
 48 | This will display the following table: (residuenumber, name, insertion code)
 49 | 
 50 | ```
 51 | 		...
 52 | 			81 HIS null
 53 | 			82 LEU null
 54 | 			82A SER A
 55 | 			82B SER B
 56 | 			82C LEU C
 57 | 			83 THR null
 58 | 			84 SER null
 59 | 		...	
 60 | ```
 61 | 
 62 | 
 63 | ## Chromophores
 64 | 
 65 | A [chromophore](http://en.wikipedia.org/wiki/Chromophore) is the part of a molecule responsible for its color. Some proteins, such as GFP contain a chromopohre that consists of three modified residues. BioJava represents this as a single group in terms of atoms, however as three amino acids when creating the amino acid sequences.
 66 | 
 67 | ```java
 68 | 			
 69 | 						
 70 | 			// make sure we download chemical component definitions
 71 | 			// which is required for correctly representing the chromophore
 72 | 			FileParsingParameters params = new FileParsingParameters();			
 73 | 			params.setLoadChemCompInfo(true);						
 74 | 			
 75 | 			// now register the parameters in the cache
 76 | 			AtomCache cache = new AtomCache();			
 77 | 			cache.setFileParsingParams(params);						
 78 | 			StructureIO.setAtomCache(cache);
 79 | 			
 80 | 			
 81 | 			// request a GFP protein
 82 | 			Structure s1 = StructureIO.getStructure("2pxw");
 83 | 			
 84 | 			// and print out the internals
 85 | 			System.out.println(s1.getPDBHeader().toPDB());
 86 | 						
 87 | 			// chromophore is at PDB residue number 66
 88 | 			for ( Chain c : s1.getChains()) {
 89 | 			
 90 | 				System.out.println("Chain " + c.getChainID() + 
 91 | 						" internal " + c.getInternalChainID() +
 92 | 						" ligands " + c.getAtomLigands().size());
 93 | 				System.out.println("         10        20        30        40        50        60");
 94 | 				System.out.println("1234567890123456789012345678901234567890123456789012345678901234567890");
 95 | 				System.out.println(c.getAtomSequence());
 96 | 				
 97 | 				int pos = 0 ;
 98 | 				for (Group g: c.getAtomGroups()) {
 99 | 					pos++;					
100 | 					System.out.println(pos + " " + g.getResidueNumber() + " " + g.getPDBName() + " " + g.getType()  + " " + g.getChemComp().getOne_letter_code() + " " + g.getChemComp().getType() );									
101 | 				}				
102 | 			}
103 | ```
104 | 
105 | This will give this output, note 'DYG' at position 63.
106 | 
107 | ```		
108 |            60
109 | 		...01234567890
110 | 		...AAFDYGNRVFTEY...
111 | ```
112 | 
113 | DYG is an unusual group - it has 3 characters as a result of .getOne_letter_code()
114 | 
115 | ```
116 | 	...
117 | 		62 65 PHE amino F L-PEPTIDE LINKING
118 | 		63 66 DYG amino DYG L-PEPTIDE LINKING
119 | 		64 69 ASN amino N L-PEPTIDE LINKING
120 | 	...
121 | ```
122 | 
123 | ## Microheterogeneity
124 | 
125 | 
126 | 
127 | <!--automatically generated footer-->
128 | 
129 | ---
130 | 
131 | Navigation:
132 | [Home](../README.md)
133 | | [Book 3: The Structure Modules](README.md)
134 | | Chapter 17 : Special Cases
135 | 
136 | Prev: [Chapter 15 : Protein Secondary Structure](secstruc.md)
137 | 
138 | Next: [Chapter 18 : Status Information](lists.md)
139 | 


--------------------------------------------------------------------------------
/structure/structure-data-model.md:
--------------------------------------------------------------------------------
  1 | # The BioJava-Structure Data Model
  2 | 
  3 | A biologically and chemically meaningful data representation of PDB/mmCIF.
  4 | 
  5 | ## The Basics   
  6 | 
  7 | BioJava at its core is a collection of file parsers and (in some cases) data models to represent frequently used biological data. The protein-structure modules represent macromolecular data in a way that should make it easy to work with. The representation is essentially independent of the underlying file format and the user can chose to work with either PDB or mmCIF files and still get an almost identical data representation. (There can be subtile differences between PDB and mmCIF data, for example the atom indices in a few entries are not 100% identical)
  8 | 
  9 | ## The Main Hierarchy
 10 | 
 11 | BioJava provides a flexible data structure for managing protein structural data. The 
 12 | [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) class is the main container. 
 13 | 
 14 | A `Structure` has a hierarchy of sub-objects:
 15 | 
 16 | <pre>
 17 | Structure 
 18 |    |
 19 |    Model(s)
 20 |         |
 21 |         Chain(s)
 22 |             |
 23 |              Group(s) -> Chemical Component Definition
 24 |                  |
 25 |                  Atom(s)
 26 | </pre>
 27 | 
 28 | All `Structure` objects contain one or more `Models`. That means also X-ray structures contain a "virtual" model which serves as a container for the chains. This allows to represent multi-model X-ray structures, e.g. from time-series analysis. The most common way to access chains is via:
 29 | 
 30 | ```java
 31 |         List<Chain> chains = structure.getChains();
 32 | ```
 33 | 
 34 | This works for both NMR and X-ray based structures and by default the first `Model` is getting accessed.
 35 | 
 36 | ## Working with Atoms
 37 | 
 38 | Different ways are provided how to access the data contained in a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html).
 39 | If you want to directly access an array of representative [Atoms](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Atom.html) (CA for proteins, P in nucleotides),you can use the utility class called [StructureTools](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureTools.html)
 40 | 
 41 | ```java
 42 |     // get all representative atoms in the structure, one for residue
 43 |     Atom[] caAtoms = StructureTools.getRepresentativeAtomArray(structure);
 44 | ```
 45 | 
 46 | Alternatively you can access atoms also by their parent-group.
 47 | 
 48 | ## Loop over All the Data
 49 | 
 50 | Here an example that loops over the whole data model and prints out the HEM groups of hemoglobin:
 51 | 
 52 | ```java
 53 | 			Structure structure = StructureIO.getStructure("4hhb");			
 54 | 
 55 | 			List<Chain> chains = structure.getChains();
 56 | 
 57 | 			System.out.println(" # chains: " + chains.size());
 58 | 
 59 | 			for (Chain c : chains) {
 60 | 				
 61 | 				System.out.println("   Chain: " + c.getId() + " # groups with atoms: " + c.getAtomGroups().size());
 62 | 
 63 | 				for (Group g: c.getAtomGroups()){
 64 | 
 65 | 					if ( g.getPDBName().equalsIgnoreCase("HEM")) {
 66 | 
 67 | 						System.out.println("   " + g);
 68 | 
 69 | 						for (Atom a: g.getAtoms()) {
 70 | 
 71 | 							System.out.println("    " + a);
 72 | 
 73 | 						}
 74 | 					}
 75 | 				}
 76 | 			}
 77 | ```
 78 | 
 79 | ## Working with Groups
 80 | 
 81 | The [Group](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Group.html) interface defines all methods common to a group of atoms. There are 3 types of Groups:
 82 | 
 83 | * [AminoAcid](http://www.biojava.org/docs/api4.2.1/org/biojava/nbio/structure/AminoAcid.html)
 84 | * [Nucleotide](http://www.biojava.org/docs/api4.2.1/org/biojava/nbio/structure/NucleotideImpl.html) 
 85 | * [Hetatom](http://www.biojava.org/docs/api4.2.1/org/biojava/nbio/structure/HetatomImpl.html) 
 86 | 
 87 | In order to get all amino acids that have been observed in a PDB chain, you can use the following utility method:
 88 | 
 89 | ```java
 90 |             Chain chain = structure.getPolyChainByPDB("A");
 91 |             List<Group> groups = chain.getAtomGroups(GroupType.AMINOACID);
 92 |             for (Group group : groups) {
 93 |                 SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC);
 94 | 
 95 |                 // print the secondary structure assignment
 96 |                 System.out.println(group + " -- " + secStrucInfo);
 97 |             }
 98 | ```
 99 | 
100 | In a similar way you can access all nucleotide groups by
101 | ```java
102 |             chain.getAtomGroups(GroupType.NUCLEOTIDE);
103 | ```
104 | 
105 | The Hetatom groups are access in a similar fashion:
106 | ```java
107 |             chain.getAtomGroups(GroupType.HETATM);
108 | ```
109 | 
110 | 
111 | Since all 3 types of groups are implementing the Group interface, you can also iterate over all groups and check for the instance type:
112 | 
113 | ```java
114 |             List<Group> allgroups = chain.getAtomGroups();
115 |             for (Group group : allgroups) {
116 |                 if (group.isAminoAcid()) {
117 |                     SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC);
118 |                     System.out.println(group + " -- " + secStrucInfo);
119 |                 }
120 |             }
121 | ```
122 | 
123 | ## A Note
124 | 
125 | The detection of the groups works really well in connection with the [Chemical Component Dictionary](checmcomp.md), as we will discuss in the next section. Without this dictionary, there can be inconsistencies in particular with chemically modified residues.
126 | 
127 | ## Entities and Chains
128 | 
129 | Entities are the distinct chemical components of structures in the PDB. 
130 | Unlike chains, entities do not include duplicate copies and each entity is different from every other 
131 | entity in the structure. There are different types of entities. Polymer entities include Protein, DNA, 
132 | and RNA. Ligands are smaller chemical components that are not part of a polymer entity. 
133 | 
134 | <pre>
135 | 	Structure -> Entity -> Chain
136 | </pre>
137 | 
138 | To explain this with an example, hemoglobin (e.g. PDB ID 4HHB) has two components, alpha 
139 | and beta. Each of the entities has two copies (= chains) in the structure. IN 4HHB, alpha 
140 | has the two chains with the IDs A, and C and beta the chains B, and D. In total, hemoglobin is 
141 | built up out of four chains.
142 | 
143 | This prints all the entities in a structure
144 | ```java
145 | 			Structure structure = StructureIO.getStructure("4hhb");			
146 | 
147 | 			System.out.println(structure);
148 | 						
149 | 			System.out.println(" # of compounds (entities) " + structure.getEntityInfos().size());
150 | 
151 | 			for ( EntityInfo entity: structure.getEntityInfos()) {
152 | 				System.out.println("   " + entity);
153 | 			}
154 | ```
155 | 
156 | 
157 | 
158 | 
159 | 
160 | 
161 | 
162 | <!--automatically generated footer-->
163 | 
164 | ---
165 | 
166 | Navigation:
167 | [Home](../README.md)
168 | | [Book 3: The Structure Modules](README.md)
169 | | Chapter 3 : Structure Data Model
170 | 
171 | Prev: [Chapter 2 : First Steps](firststeps.md)
172 | 
173 | Next: [Chapter 4 : Local Installations](caching.md)
174 | 


--------------------------------------------------------------------------------
/structure/symmetry.md:
--------------------------------------------------------------------------------
  1 | Protein Symmetry using BioJava
  2 | ================================================================
  3 | 
  4 | BioJava can be used to detect, analyze, and visualize **symmetry** and
  5 | **pseudo-symmetry** in the **quaternary** (biological assembly) and tertiary
  6 | (**internal**) structural levels of proteins.
  7 | 
  8 | ## Quaternary Symmetry
  9 | 
 10 | The **quaternary symmetry** of a structure defines the relation and arrangement of the individual chains or groups of chains that are part of a biological assembly. 
 11 | For a more exhaustive explanation about protein quaternary symmetery and the different types visit the [PDB help page](http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html).
 12 | 
 13 | In the **quaternary symmetry** detection problem, we are given a set of chains (subunits) that are part of a biological assembly as input, defined by their atomic coordinates, and we are required to find the higest overall symmetry group that
 14 | relates them as ouptut. 
 15 | The solution is divided into the following steps:
 16 | 
 17 | 1. First, we need to identify the chains that are identical (or similar
 18 | in the pseudo-symmetry case). For that purpose, we perform a pairwise alignment of all
 19 | chains and identify **clusters of identical or similar subunits**.
 20 | 2. Next, we reduce each of the polypeptide chains to a single point, their **centroid** (center of mass).
 21 | 3. Afterwards, we try different **symmetry operations** using a grid search to superimpose the chain centroids
 22 | and score them using the RMSD.
 23 | 4. Finally, based on the parameters (cutoffs), we determine the **overall symmetry** of the
 24 | structure, with the symmetry relations obtained in the previous step.
 25 | 5. In case of asymmetric structure, we discard combinatorially a number of chains and try
 26 | to detect any **local symmetries** present (symmetry that does not involve all subunits of the biological assembly).
 27 | 
 28 | The **quaternary symmetry** detection algorithm is implemented in the biojava class
 29 | [QuatSymmetryDetector](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/core/QuatSymmetryDetector).
 30 | An example of how to use it programatically is shown below:
 31 | 
 32 | ```java
 33 | // First download the structure in the biological assembly form
 34 | Structure s;
 35 | 
 36 | // Set some parameters if needed different than DEFAULT - see descriptions
 37 | QuatSymmetryParameters parameters = new QuatSymmetryParameters();
 38 | SubunitClustererParameters clusterParams = new SubunitClustererParameters();
 39 | 
 40 | // Instantiate the detector
 41 | QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams);
 42 | 
 43 | // Static methods in QuatSymmetryDetector perform the calculation
 44 | QuatSymmetryResults globalResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams);
 45 | List<QuatSymmetryResults> localResults = QuatSymmetryDetector.getLocalSymmetries(s, parameters, clusterParams);
 46 | 
 47 | ```
 48 | See also the [demo](https://github.com/biojava/biojava/blob/885600670be75b7f6bc5216bff52a93f43fff09e/biojava-structure/src/main/java/demo/DemoSymmetry.java#L37-L59) provided in **BioJava** for a real case working example.
 49 | 
 50 | The returned `QuatSymmetryResults` object contains all the information of the subunit clustering and structural symmetry.
 51 | This object will be used later to obtain axes of symmetry, point group name, stoichiometry or even display the results in Jmol.
 52 | The return object of quaternary symmetry (`QuatSymmetryResults`) contains the 
 53 | In case of asymmetrical structure, the result is a C1 point group.
 54 | The return type of the local symmetry is a `List` because there can be multiple valid options of local symmetry.
 55 | The list will be empty if there exist no local symmetries in the structure.
 56 | 
 57 | 
 58 | ### Global Symmetry
 59 | 
 60 | In the **global symmetry** mode all chains have to be part of the symmetry result.
 61 | 
 62 | #### Point Group
 63 | 
 64 | In a **point group** a single or multiple rotation axes define the overall symmetry
 65 | operations, with the property that all the axes coincide in the same point.
 66 | 
 67 | ![PDB ID 1VYM](img/symm_pg.png)
 68 | 
 69 | #### Helical
 70 | 
 71 | In **helical** symmetry there is a single axis with rotation and translation
 72 | components.
 73 | 
 74 | ![PDB ID 4UDV](img/symm_helical.png)
 75 | 
 76 | ### Local Symmetry
 77 | 
 78 | In **local symmetry** a number of chains is left out, so that the symmetry only applies to a subset of chains.
 79 | 
 80 | ![PDB ID 4F88](img/symm_local.png)
 81 | 
 82 | ### Pseudo-Symmetry
 83 | 
 84 | In **pseudo-symmetry** the chains related by the symmetry are not completely
 85 | identical, but they share a sequence or structural similarity above the pseudo-symmetry
 86 | similarity threshold.
 87 | 
 88 | If we consider hemoglobin, at a 95% sequence identity threshold the alpha and
 89 | beta subunits are considered different, which correspond to an A2B2 stoichiometry
 90 | and a C2 point group. At the structural similarity level, all four chains are
 91 | considered homologous (~45% sequence identity) with an A4 pseudostoichiometry and
 92 | D2 pseudosymmetry.
 93 | 
 94 | ![PDB ID 4HHB](img/symm_pseudo.png)
 95 | 
 96 | ## Internal Symmetry
 97 | 
 98 | **Internal symmetry** refers to the symmetry present in a single chain, that is,
 99 | the tertiary structure. The algorithm implemented in biojava to detect internal
100 | symmetry is called **CE-Symm**.
101 | 
102 | ### CE-Symm
103 | 
104 | The **CE-Symm** algorithm was originally developed by [Myers-Turnbull D., Bliven SE.,
105 | Rose PW., Aziz ZK., Youkharibache P., Bourne PE. & Prlić A. in 2014]
106 | (http://www.sciencedirect.com/science/article/pii/S0022283614001557)  [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/24681267).
107 | As the name of the algorithm explicitly states, **CE-Symm** uses the Combinatorial
108 | Extension (**CE**) algorithm to generate an alignment of the structure chain to itself,
109 | disabling the identity alignment (the diagonal of the **DotPlot** representation of a
110 | structure alignment). This allows the identification of alternative self-alignments,
111 | which are related to symmetry and/or structural repeats inside the chain.
112 | 
113 | By a procedure called **refinement**, the subunits of the chain that are part of the symmetry
114 | are defined and a **multiple alignment** is created. This process can be thought as to
115 | divide the chain into other subchains, and then superimposing each subchain to each other to
116 | create a multiple alignment of the subunits, respecting the symmetry axes.
117 | 
118 | The **internal symmetry** detection algorithm is implemented in the biojava class
119 | [CeSymm](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/internal/CeSymm).
120 | It returns a `MultipleAlignment` object, see the explanation of the model in [Data Models](alignment-data-model.md),
121 | that describes the similarity of the internal repeats. In case of no symmetry detected, the
122 | returned alignment represents the optimal self-alignment produced by the first step of the **CE-Symm**
123 | algorithm.
124 | 
125 | ```java
126 | //Input the atoms in a chain as an array
127 | Atom[] atoms = StructureTools.getRepresentativeAtomArray(chain);
128 | 
129 | //Initialize the algorithm
130 | CeSymm ceSymm = new CeSymm();
131 | 
132 | //Choose some parameters
133 | CESymmParameters params = ceSymm.getParameters();
134 | params.setRefineMethod(RefineMethod.SINGLE);
135 | params.setOptimization(true);
136 | params.setMultipleAxes(true);
137 | 
138 | //Run the symmetry analysis - alignment as an output
139 | MultipleAlignment symmetry = ceSymm.analyze(atoms, params);
140 | 
141 | //Test if the alignment returned was refined with
142 | boolean refined = SymmetryTools.isRefined(symmetry);
143 | 
144 | //Get the axes of symmetry from the aligner
145 | SymmetryAxes axes = ceSymm.getSymmetryAxes();
146 | 
147 | //Display the results in jmol with the SymmetryDisplay
148 | SymmetryDisplay.display(symmetry, axes);
149 | 
150 | //Show the point group, if any of the internal symmetry
151 | QuatSymmetryResults pg = SymmetryTools.getQuaternarySymmetry(symmetry);
152 | System.out.println(pg.getSymmetry());
153 | 
154 | ```
155 | 
156 | To enable some extra features in the display, a `SymmetryDisplay`
157 | class has been created, although the `MultipleAlignmentDisplay` method
158 | can also be used for that purpose (it will not show symmetry axes or
159 | symmetry menus).
160 | 
161 | Lastly, the `SymmetryGUI` class in the **structure-gui** package
162 | provides a GUI to trigger internal symmetry analysis, equivalent
163 | to the GUI to trigger structure alignments.
164 | 
165 | ### Symmetry Display
166 | 
167 | The symmetry display is similar to the **quaternary symmetry**, because
168 | part of the code is shared. See for example this beta-propeller (1U6D),
169 | where the repeated beta-sheets are connected by a linker forming a C6
170 | point group internal symmetry:
171 | 
172 | ![PDB ID 1U6D](img/symm_internal.png)
173 | 
174 | #### Hierarchical Symmetry
175 | 
176 | One additional feature of the **internal symmetry** display is the representation
177 | of hierarchical symmetries and repeats. Contrary to point groups, some structures
178 | have different **levels** of symmetry. That is, the whole strucutre has, e.g. C2
179 | symmetry and, at the same time, each of the two parts has C2 symmetry, but the axes
180 | of both levels are not related by a point group (i.e. they do not cross to a single
181 | point).
182 | 
183 | A very clear example are the beta-gamma-crystallins, like 4GCR:
184 | 
185 | ![PDB ID 4GCR](img/symm_hierarchy.png)
186 | 
187 | #### Subunit Multiple Alignment
188 | 
189 | Another feature of the display is the option to show the **multiple alignment** of
190 | the symmetry related subunits created during the **refinement** process. Search for
191 | the option *Subunit Superposition* in the *symmetry* menu of the Jmol window. For
192 | the previous example the display looks like that:
193 | 
194 | ![PDB ID 4GCR](img/symm_subunits.png)
195 | 
196 | The subunit display highlights the differences and similarities between the symmetry
197 | related subunits of the chain, and helps the user to identify conseved and divergent
198 | regions, with the help of the *Sequence Alignment Panel*.
199 | 
200 | ## Quaternary + Internal Overall Symmetry
201 | 
202 | Finally, the internal and quaternary symmetries can be merged to obtain the
203 | overall combined symmetry. As we have seen before, the protein 1VYM is a DNA-clamp that
204 | has three chains arranged in a C3 symmetry. 
205 | Each chain is internally fourfold symmetric with two levels of symmetry. We can analyze the overall symmetry of the structure by considering together the C3 quaternary symmetry and the fourfold internal symmetry. 
206 | In this case, the internal symmetry **augments** the point group of the quaternary symmetry to a D6 overall symmetry, as we can see in the figure below:
207 | 
208 | ![PDB ID 1VYM](img/symm_combined.png)
209 | 
210 | An example of how to toggle the **combined symmetry** (quaternary + internal symmetries) programatically is shown below:
211 | 
212 | ```java
213 | // First download the structure in the biological assembly form
214 | Structure s;
215 | 
216 | // Initialize default parameters
217 | QuatSymmetryParameters parameters = new QuatSymmetryParameters();
218 | SubunitClustererParameters clusterParams = new SubunitClustererParameters();
219 | 
220 | // In SubunitClustererParameters set the clustering method to STRUCTURE and the internal symmetry option to true
221 | clusterParams.setClustererMethod(SubunitClustererMethod.STRUCTURE);
222 | clusterParams.setInternalSymmetry(true);
223 | 
224 | // You can lower the default structural coverage to improve the recall
225 | clusterParams.setStructureCoverageThreshold(0.75);
226 | 
227 | // Instantiate the detector
228 | QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams);
229 | 
230 | // Static methods in QuatSymmetryDetector perform the calculation
231 | QuatSymmetryResults overallResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams);
232 | 
233 | ```
234 | 
235 | See also the [test](https://github.com/biocryst/biojava/blob/df22da37a86a0dba3fb35bee7e17300d402ab469/biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/symmetry/TestQuatSymmetryDetectorExamples.java#L167-L192) provided in **BioJava** for a real case working example.
236 | 
237 | 
238 | ## Please Cite
239 | 
240 | **Analyzing the symmetrical arrangement of structural repeats in proteins with CE-Symm**<br/>
241 | *Spencer E Bliven, Aleix Lafita, Peter W Rose, Guido Capitani, Andreas Prlić, & Philip E Bourne* <br/>
242 | [PLOS Computational Biology (2019) 15 (4):e1006842.](https://journals.plos.org/ploscompbiol/article/citation?id=10.1371/journal.pcbi.1006842) <br/>
243 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006842-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006842) [![pubmed](https://img.shields.io/badge/pubmed-31009453-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/31009453)
244 | 
245 | 
246 | 
247 | <!--automatically generated footer-->
248 | 
249 | ---
250 | 
251 | Navigation:
252 | [Home](../README.md)
253 | | [Book 3: The Structure Modules](README.md)
254 | | Chapter 14 : Protein Symmetry
255 | 
256 | Prev: [Chapter 13 - Finding all Interfaces in Crystal: Crystal Contacts](crystal-contacts.md)
257 | 
258 | Next: [Chapter 15 : Protein Secondary Structure](secstruc.md)
259 | 


--------------------------------------------------------------------------------