├── .gitignore ├── README.md ├── alignment ├── README.md ├── img │ └── alignment.png ├── installation.md └── smithwaterman.md ├── bin └── update_index.py ├── core ├── README.md ├── img │ └── core.png ├── installation.md ├── readwrite.md ├── sequences.md └── translating.md ├── genomics ├── README.md ├── chromosomeposition.md ├── genebank.md ├── genenames.md ├── gff.md ├── img │ └── genomics.png ├── installation.md ├── karyotype.md └── twobit.md ├── installation.md ├── license.md ├── logo.png ├── modfinder ├── README.md ├── add-protein-modification.md ├── identify-protein-modifications.md ├── installation.md └── supported-protein-modifications.md ├── protein-disorder └── README.md └── structure ├── README.md ├── alignment-data-model.md ├── alignment.md ├── asa.md ├── bioassembly.md ├── caching.md ├── chemcomp.md ├── contact-map.md ├── crystal-contacts.md ├── externaldb.md ├── firststeps.md ├── img ├── 143px-Selenomethionine-from-xtal-3D-balls.png ├── 1cfd_1cll_fatcat.png ├── 1cfd_1cll_fatcat.xcf ├── 1cfd_1cll_flexible.png ├── 1cfd_1cll_rigid.png ├── 1dan_scop.png ├── 1gav_asym.png ├── 1gav_biounit.png ├── 1hho_asym.png ├── 1hho_biounit.png ├── 1m4x_bio_r_250.jpg ├── 2hyn_1zll.png ├── 3cna.A_2pel.A_cecp.png ├── 4hhb_bio_r_250.jpg ├── 4hhb_jmol.png ├── alignment_gui.png ├── alignmentpanel.png ├── cath_1dan.png ├── database_search.png ├── database_search_results.png ├── multiple_gui.png ├── multiple_jmol_globins.png ├── multiple_panel_globins.png ├── symm_combined.png ├── symm_helical.png ├── symm_hierarchy.png ├── symm_internal.png ├── symm_local.png ├── symm_pg.png ├── symm_pseudo.png └── symm_subunits.png ├── installation.md ├── lists.md ├── mmcif.md ├── secstruc.md ├── seqres.md ├── special.md ├── structure-data-model.md └── symmetry.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | .profile 3 | .settings 4 | .idea -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Tutorial 2 | === 3 | 4 | A brief introduction into [BioJava](https://www.biojava.org). 5 | ----- 6 | 7 | The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. This tutorial is still under development, hence not yet comprehensive for the entire library. Please also check other sources of [documentation](https://biojava.org/wiki/Documentation). 8 | 9 | The examples within the tutorial are intended to work with the most recent version of BioJava. Please do submit a [new issue](https://github.com/biojava/biojava-tutorial/issues) if you find any problems. 10 | 11 | The tutorial is subdivided into several books, corresponding to the respective BioJava modules. Each book is further subdivided into several chapters that intend to describe the main functionality of the module in order of increasing complexity. 12 | 13 | ## Index 14 | 15 | [Quick Installation](installation.md) 16 | 17 | Book 1: [The Core Module](core/README.md), basic working with sequences. 18 | 19 | Book 2: [The Alignment Module](alignment/README.md), pairwise and multiple alignments of protein sequences. 20 | 21 | Book 3: [The Structure Modules](structure/README.md), everything related to working with 3D structures. 22 | 23 | Book 4: [The Genomics Module](genomics/README.md), working with genomic data. 24 | 25 | Book 5: [The Protein-Disorder Module](protein-disorder/README.md), predicting protein-disorder. 26 | 27 | Book 6: [The ModFinder Module](modfinder/README.md), identifying protein modifications in 3D structures 28 | 29 | ## License 30 | 31 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md). 32 | 33 | ## Please Cite 34 | 35 | **BioJava 5: A community driven open-source bioinformatics library**
36 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
37 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
38 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) 39 | 40 | 41 | 42 | -------------------------------------------------------------------------------- /alignment/README.md: -------------------------------------------------------------------------------- 1 | The BioJava - Alignment Module 2 | ===================================================== 3 | 4 | A tutorial for the alignment module of [BioJava](http://www.biojava.org). 5 | 6 | ## About 7 | 8 | 9 | 12 | 20 | 21 |
10 | 11 | 13 | The alignment module of BioJava provides an API that contains 14 |
    15 |
  • Implementations of dynamic programming algorithms for sequence alignment
  • 16 |
  • Reading and Writing of popular alignment file formats
  • 17 |
  • A single-, or multi- threaded multiple sequence alignment algorithm.
  • 18 |
19 |
22 | 23 | ## Index 24 | 25 | This tutorial is split into several chapters. 26 | 27 | Chapter 1 - Quick [Installation](installation.md) 28 | 29 | Chapter 2 - Global alignment - Needleman and Wunsch algorithm 30 | 31 | Chapter 3 - [Local alignment](smithwaterman.md) - Smith-Waterman algorithm 32 | 33 | Chapter 4 - Multiple Sequence alignment 34 | 35 | Chapter 5 - Reading and writing of multiple alignments 36 | 37 | Chapter 6 - BLAST - why you don't need BioJava for parsing BLAST 38 | 39 | ## License 40 | 41 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). 42 | 43 | ## Please cite 44 | 45 | **BioJava 5: A community driven open-source bioinformatics library**
46 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
47 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
48 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) 49 | 50 | 51 | 52 | 53 | 54 | --- 55 | 56 | Navigation: 57 | [Home](../README.md) 58 | | Book 2: The Alignment Module 59 | 60 | Prev: [Book 1: The Core Module](../core/README.md) 61 | 62 | Next: [Book 3: The Structure Modules](../structure/README.md) 63 | -------------------------------------------------------------------------------- /alignment/img/alignment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/alignment/img/alignment.png -------------------------------------------------------------------------------- /alignment/installation.md: -------------------------------------------------------------------------------- 1 | ## Quick Installation 2 | 3 | In the beginning, just one quick paragraph of how to get access to BioJava. 4 | 5 | BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way: 6 | 7 | BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html) guide. 8 | 9 | As of version 4, BioJava is available in maven central. This is all you would need to add a BioJava dependency to your projects: 10 | 11 | ```xml 12 | 13 | ... 14 | 15 | 16 | 17 | 18 | org.biojava 19 | biojava-alignment 20 | 4.0.0 21 | 22 | 23 | 24 | 25 | 26 | 27 | ``` 28 | 29 | If you run 30 | 31 |
32 |     mvn package
33 | 
34 | 35 | on your project, the BioJava dependencies will be automatically downloaded and installed for you. 36 | 37 | 38 | 39 | 40 | --- 41 | 42 | Navigation: 43 | [Home](../README.md) 44 | | [Book 2: The Alignment Module](README.md) 45 | | Chapter 1 : Installation 46 | -------------------------------------------------------------------------------- /alignment/smithwaterman.md: -------------------------------------------------------------------------------- 1 | Smith Waterman - Local Alignment 2 | ################################ 3 | 4 | BioJava contains implementation for various protein sequence and 3D structure alignment algorithms. Here is how to run a local, Smith-Waterman, alignment of two protein sequences: 5 | 6 | 7 | 8 | ```java 9 | public static void main(String[] args) throws Exception { 10 | 11 | String uniprotID1 = "P69905"; 12 | String uniprotID2 = "P68871"; 13 | 14 | ProteinSequence s1 = getSequenceForId(uniprotID1); 15 | ProteinSequence s2 = getSequenceForId(uniprotID2); 16 | 17 | SubstitutionMatrix matrix = SubstitutionMatrixHelper.getBlosum65(); 18 | 19 | GapPenalty penalty = new SimpleGapPenalty(); 20 | 21 | int gop = 8; 22 | int extend = 1; 23 | penalty.setOpenPenalty(gop); 24 | penalty.setExtensionPenalty(extend); 25 | 26 | 27 | PairwiseSequenceAligner smithWaterman = 28 | Alignments.getPairwiseAligner(s1, s2, PairwiseSequenceAlignerType.LOCAL, penalty, matrix); 29 | 30 | SequencePair pair = smithWaterman.getPair(); 31 | 32 | 33 | System.out.println(pair.toString(60)); 34 | 35 | 36 | } 37 | 38 | private static ProteinSequence getSequenceForId(String uniProtId) throws Exception { 39 | URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId)); 40 | ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId); 41 | System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader()); 42 | System.out.println(); 43 | 44 | return seq; 45 | } 46 | ``` 47 | -------------------------------------------------------------------------------- /bin/update_index.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | This script generates the footers for all markdown files. Rerun the script 4 | after adding new books or chapters in order to update the footer sections on 5 | each page with links to the next and previous chapters. 6 | 7 | The script works by recursively parsing "## Index" sections in files, starting 8 | with README.md. The footer is marked with an HTML comment, `automatically 9 | generated footer`. Any text after this comment is destroyed by the script, so 10 | all edits should be made above that point. 11 | 12 | """ 13 | 14 | import sys,os,re 15 | 16 | class TutorialIndex(object): 17 | 18 | footermark = u"" 19 | 20 | def __init__(self,link,chapter=None,title=None,parent=None): 21 | """Create a new TutorialIndex 22 | 23 | :param link: A link to this page, relative to parent's link 24 | :param chapter: The chapter number, e.g. "Chapter 5" 25 | :param title: The chapter title, e.g. "Writing Docstrings" 26 | :param parent: The TutorialIndex which references this one 27 | """ 28 | self.link = link 29 | self.chapter = chapter 30 | self.title = title 31 | self.parent = parent 32 | self.children = [] 33 | 34 | def parse(self): 35 | """Parse the index, add a footer, and do the same for each child 36 | found in the '# Index' section (if any). 37 | 38 | Caution, this method overwrites any existing footer sections on this 39 | file and all children! 40 | """ 41 | 42 | #Recognise and parse ": ()[<LINK>]" 43 | indexentry = re.compile("^(.*)[:-].*\[([^]]*)\]\(([^)]*)\).*$") 44 | 45 | filename = self.rootlink() 46 | 47 | with open(filename,"r+") as file: 48 | line = file.readline() 49 | had_footer=False 50 | 51 | # Parse file for index, truncate prior footer, and append footermark 52 | in_index = False 53 | while line: 54 | if line[0] == u"#": #That's a header, not a comment 55 | if u"index" in line.lower(): 56 | in_index = True 57 | else: 58 | in_index = False 59 | elif line.strip() == TutorialIndex.footermark: # Footer already! 60 | had_footer=True 61 | file.truncate() 62 | break 63 | elif in_index: 64 | # look for 'Chapter 1: [Title](link)' 65 | result = indexentry.match(line) 66 | if result: 67 | chapter,title,link = result.groups() 68 | child = TutorialIndex(link,chapter,title,self) 69 | self.children.append(child) 70 | 71 | line = file.readline() 72 | 73 | # Append footer 74 | if not had_footer: 75 | file.write(u"\n") 76 | file.write(TutorialIndex.footermark) 77 | file.write(u"\n") 78 | footer = self.makefooter() 79 | file.write(footer) 80 | 81 | # Recurse to children 82 | for child in self.children: 83 | child.parse() 84 | 85 | def rootlink(self): 86 | """Convert self.link to an absolute path relative to the root TutorialIndex 87 | :return: The path to this TutorialIndex relative to the root index 88 | """ 89 | if self.parent is None: 90 | return self.link 91 | parentlink = self.parent.rootlink() 92 | 93 | return os.path.join(os.path.dirname(parentlink),self.link) 94 | 95 | def makefooter(self): 96 | """ makefooter() -> str 97 | 98 | Creates the footer text (everything below the "automatically generated 99 | footer" line) 100 | """ 101 | # Don't include footer on main page 102 | if self.parent is None: 103 | return "" 104 | 105 | lines = ["","---","","Navigation:"] 106 | # Iterate over parents 107 | p = self.parent 108 | linkmd = [self.makename()] #reverse order (self to root) 109 | while p is not None: 110 | name = p.makename() 111 | # Get a path to p relative to our own path 112 | link = os.path.relpath(p.rootlink(),os.path.dirname(self.rootlink())) 113 | linkmd.append("[{0}]({1})".format(name,link)) 114 | p = p.parent 115 | linkmd.reverse() 116 | lines.append("\n| ".join(linkmd)) 117 | 118 | lines.append("") 119 | 120 | if self.parent is not None: 121 | pos = self.parent.children.index(self) #Should always work 122 | if pos > 0: 123 | prev = self.parent.children[pos-1] 124 | name = prev.makename() 125 | link = os.path.relpath(prev.rootlink(),os.path.dirname(self.rootlink())) 126 | lines.append("Prev: [{0}]({1})".format(name,link)) 127 | lines.append("") 128 | if pos < len(self.parent.children)-1: 129 | next = self.parent.children[pos+1] 130 | name = next.makename() 131 | link = os.path.relpath(next.rootlink(),os.path.dirname(self.rootlink())) 132 | lines.append("Next: [{0}]({1})".format(name,link)) 133 | lines.append("") 134 | 135 | #lines.append(self.makename()+", "+self.link) 136 | return "\n".join(lines) 137 | 138 | def makename(self): 139 | """ Return a name, like "<CHAPTER>: <TITLE>" 140 | """ 141 | if self.chapter: 142 | name = self.chapter 143 | if self.title: 144 | name += ": " + self.title 145 | elif self.title: 146 | name = self.title 147 | else: 148 | name = self.link #last resort 149 | 150 | return name 151 | 152 | def __repr__(self): 153 | return "TutorialIndex({self.link!r},{self.chapter!r},{self.title!r},{parent!r})" \ 154 | .format(self=self,parent=self.parent.title if self.parent else None) 155 | 156 | if __name__ == "__main__": 157 | # Set root index 158 | root = TutorialIndex("README.md",title="Home") 159 | 160 | # Rewrite headers 161 | root.parse() 162 | 163 | # Output tree 164 | def pr(node,indent=""): 165 | print "{0}{1}".format(indent,node.link,node.rootlink()) 166 | for n in node.children: 167 | pr(n,indent+" ") 168 | 169 | pr(root) 170 | -------------------------------------------------------------------------------- /core/README.md: -------------------------------------------------------------------------------- 1 | The BioJava - Core Module 2 | ===================================================== 3 | 4 | A tutorial for the core module of [BioJava](http://www.biojava.org). 5 | 6 | ## About 7 | <table> 8 | <tr> 9 | <td> 10 | <img src="img/core.png"/> 11 | </td> 12 | <td> 13 | The <i>core</i> module of BioJava provides an API that provides 14 | <ul> 15 | <li>Basic operations with biological sequences</li> 16 | <li>Reading and Writing of popular sequence file formats</li> 17 | <li>Translate DNA sequences into protein sequences</li> 18 | </ul> 19 | </td> 20 | </tr> 21 | </table> 22 | 23 | ## Index 24 | 25 | This tutorial is split into several chapters. 26 | 27 | Chapter 1 - Quick [Installation](installation.md) 28 | 29 | Chapter 2 - [Basic Sequence types](sequences.md) 30 | 31 | Chapter 3 - [Reading and Writing sequences](readwrite.md) 32 | 33 | Chapter 4 - [Translating](translating.md) DNA and protein sequences. 34 | 35 | ## License 36 | 37 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). 38 | 39 | ## Please Cite 40 | 41 | **BioJava 5: A community driven open-source bioinformatics library**<br/> 42 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte* <br/> 43 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791) <br/> 44 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) 45 | 46 | 47 | 48 | <!--automatically generated footer--> 49 | 50 | --- 51 | 52 | Navigation: 53 | [Home](../README.md) 54 | | Book 1: The Core Module 55 | 56 | Next: [Book 2: The Alignment Module](../alignment/README.md) 57 | -------------------------------------------------------------------------------- /core/img/core.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/core/img/core.png -------------------------------------------------------------------------------- /core/installation.md: -------------------------------------------------------------------------------- 1 | ## Quick Installation 2 | 3 | In the beginning, just one quick paragraph of how to get access to BioJava. 4 | 5 | BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way: 6 | 7 | BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html) guide. 8 | 9 | 10 | As of version 4, BioJava is available in maven central. This is all you would need to add a BioJava dependency to your projects: 11 | 12 | ```xml 13 | <dependencies> 14 | ... 15 | 16 | <!-- This imports the latest version of BioJava core module --> 17 | <dependency> 18 | 19 | <groupId>org.biojava</groupId> 20 | <artifactId>biojava-core</artifactId> 21 | <version>4.0.0</version> 22 | </dependency> 23 | 24 | 25 | <!-- other biojava jars as needed --> 26 | 27 | </dependencies> 28 | ``` 29 | 30 | If you run 31 | 32 | <pre> 33 | mvn package 34 | </pre> 35 | 36 | on your project, the BioJava dependencies will be automatically downloaded and installed for you. 37 | 38 | 39 | <!--automatically generated footer--> 40 | 41 | --- 42 | 43 | Navigation: 44 | [Home](../README.md) 45 | | [Book 1: The Core Module](README.md) 46 | | Chapter 1 : Installation 47 | 48 | Next: [Chapter 2 : Basic Sequence types](sequences.md) 49 | -------------------------------------------------------------------------------- /core/readwrite.md: -------------------------------------------------------------------------------- 1 | Reading and Writing of Basic sequence file formats 2 | ================================================== 3 | 4 | 5 | TODO: needs more examples 6 | 7 | 8 | ## FASTA 9 | 10 | A quick way of parsing a FASTA file is using the FastaReaderHelper class. 11 | 12 | Here an example that parses a UniProt FASTA file into a protein sequence. 13 | 14 | ```java 15 | public static ProteinSequence getSequenceForId(String uniProtId) throws Exception { 16 | URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId)); 17 | ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId); 18 | System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader()); 19 | System.out.println(); 20 | 21 | return seq; 22 | } 23 | ``` 24 | 25 | 26 | BioJava can also be used to parse large FASTA files. The example below can parse a 1GB (compressed) version of TREMBL with standard memory settings. 27 | 28 | 29 | ```java 30 | 31 | 32 | 33 | /** Download a large file, e.g. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz 34 | * and pass in path to local location of file 35 | * 36 | * @param args 37 | */ 38 | public static void main(String[] args) { 39 | 40 | if ( args.length < 1) { 41 | System.err.println("First argument needs to be path to fasta file"); 42 | return; 43 | } 44 | 45 | File f = new File(args[0]); 46 | 47 | if ( ! f.exists()) { 48 | System.err.println("File does not exist " + args[0]); 49 | return; 50 | } 51 | 52 | try { 53 | 54 | // automatically uncompresses files using InputStreamProvider 55 | InputStreamProvider isp = new InputStreamProvider(); 56 | 57 | InputStream inStream = isp.getInputStream(f); 58 | 59 | FastaReader<ProteinSequence, AminoAcidCompound> fastaReader = new FastaReader<ProteinSequence, AminoAcidCompound>( 60 | inStream, 61 | new GenericFastaHeaderParser<ProteinSequence, AminoAcidCompound>(), 62 | new ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())); 63 | 64 | LinkedHashMap<String, ProteinSequence> b; 65 | 66 | 67 | int nrSeq = 0; 68 | 69 | while ((b = fastaReader.process(10)) != null) { 70 | for (String key : b.keySet()) { 71 | nrSeq++; 72 | System.out.println(nrSeq + " : " + key + " " + b.get(key)); 73 | } 74 | 75 | } 76 | } catch (Exception ex) { 77 | Logger.getLogger(ParseFastaFileDemo.class.getName()).log(Level.SEVERE, null, ex); 78 | } 79 | } 80 | ``` 81 | 82 | BioJava can also process large FASTA files using the Java streams API. 83 | 84 | ```java 85 | FastaStreamer 86 | .from(path) 87 | .stream() 88 | .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString())); 89 | ``` 90 | 91 | If you need to specify a header parser other that `GenericFastaHeaderParser` or a sequence creater other than a 92 | `ProteinSequenceCreator`, these can be specified before streaming the contents as follows: 93 | 94 | ```java 95 | FastaStreamer 96 | .from(path) 97 | .withHeaderParser(new PlainFastaHeaderParser<>()) 98 | .withSequenceCreator(new CasePreservingProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())) 99 | .stream() 100 | .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString())); 101 | ``` 102 | 103 | 104 | 105 | <!--automatically generated footer--> 106 | 107 | --- 108 | 109 | Navigation: 110 | [Home](../README.md) 111 | | [Book 1: The Core Module](README.md) 112 | | Chapter 3 : Reading and Writing sequences 113 | 114 | Prev: [Chapter 2 : Basic Sequence types](sequences.md) 115 | 116 | Next: [Chapter 4 : Translating](translating.md) 117 | -------------------------------------------------------------------------------- /core/sequences.md: -------------------------------------------------------------------------------- 1 | Sequences in BioJava 2 | ===================== 3 | 4 | BioJava supports a number of basic biological sequence types: DNA, RNA, and protein sequences. 5 | 6 | ## Create a basic sequence object 7 | 8 | Create a DNA sequence 9 | 10 | ```java 11 | DNASequence seq = new DNASequence("GTAC"); 12 | ``` 13 | 14 | In addition to the basic DNA sequence class there are specialized classes that extend DNASequence: 15 | ChromosomeSequence, GeneSequence, IntronSequence, ExonSequence, TranscriptSequence 16 | 17 | Create a RNA sequence 18 | 19 | ```java 20 | RNASequence seq = new RNASequence("GUAC"); 21 | ``` 22 | 23 | Create a protein sequence 24 | 25 | ```java 26 | ProteinSequence seq = new ProteinSequence("MSTNPKPQRKTKRNTNRRPQDVKFPGG"); 27 | ``` 28 | 29 | ## Ambiguity codes 30 | 31 | In particular when dealing with nucleotide sequences, sometimes the exact nucleotides are not known. 32 | BioJava supports standard conventions for dealing with such ambiguity. 33 | For example to represent the nucleotides "A or T" often "W" is getting used. 34 | The expected set of compounds in a sequence by default is strict, however it takes only one line of code to switch to supporting 35 | ambiguity codes. 36 | 37 | 38 | ```java 39 | // this throws an error 40 | DNASequence dna2 = new DNASequence("WWW"); 41 | 42 | // however this works: 43 | AmbiguityDNACompoundSet ambiguityDNACompoundSet = AmbiguityDNACompoundSet.getDNACompoundSet(); 44 | DNASequence dna2 = new DNASequence("WWW",ambiguityDNACompoundSet); 45 | ``` 46 | 47 | 48 | ## Protein sequences and ambiguity 49 | The default AminoAcidCompoundSet already supports "Asparagine or Aspartic acid" and related ambiguities. 50 | It also contains support for Selenocysteine and Pyrrolysine 51 | 52 | 53 | 54 | ## More details 55 | 56 | See the Cookbook for [more details on dealing with sequences] (http://biojava.org/wiki/BioJava:CookBook:Core:Overview) 57 | <!--automatically generated footer--> 58 | 59 | --- 60 | 61 | Navigation: 62 | [Home](../README.md) 63 | | [Book 1: The Core Module](README.md) 64 | | Chapter 2 : Basic Sequence types 65 | 66 | Prev: [Chapter 1 : Installation](installation.md) 67 | 68 | Next: [Chapter 3 : Reading and Writing sequences](readwrite.md) 69 | -------------------------------------------------------------------------------- /core/translating.md: -------------------------------------------------------------------------------- 1 | Translating RNA and protein sequences 2 | ===================================== 3 | 4 | 5 | An example for how to parse a sequence from a String and using the Translation engine to convert into amino acid sequence. 6 | 7 | ```java 8 | String dnaFastaS = ">gb:GQ903697|Organism:Arenavirus H0030026 H0030026|Segment:S|Host:Rat\n" + 9 | "CGCACAGAGGATCCTAGGCGTTACTGACTTGCGCTAATAACAGATACTGTTTCATATTTAGATAAAGACC\n" + 10 | "CAGCCAACTGATTGGTCAGCATGGGACAACTTGTGTCCCTCTTCAGTGAAATTCCATCAATCATACACGA\n" + 11 | "AGCTCTCAATGTTGCTCTCGTAGCTGTTAGCATCATTGCAATATTGAAAGGGGTTGTGAATGTTTGGAAG\n" + 12 | "AGTGGAGTTTTGCAGCTTTTGGCCTTCTTGCTCCTGGCGGGAAGATCCTGCTCAGTCATAATTGGTCATC\n" + 13 | "ATCTCGAACTGCAGCATGTGATCTTCAATGGGTCATCAATCACACCCTTTTTACCAGTTACATGTAAGAT\n" + 14 | "CAATGATACCTACTTCCTACTAAGAGGCCCCTATGAAGCTGATTGGGCAGTTGAATTGAGTGTAACTGAA\n" + 15 | "ACCACAGTCTTGGTTGATCTTGAAGGTGGCAGCTCAATGAAGCTGAAAGCCGGAAACATCTCAGGTTGTC\n" + 16 | "TTGGAGACAACCCCCATCTGAGATCAGTGGTCTTCACATTGAATTGGTTGCTAACAGGATTAGATCATGT\n" + 17 | "TATTGATTCTGACCCGAAAATTCTCTGTGATCTTAAAGACAGTGGGCACTTTCGTCTCCAGATGAACTTA\n" + 18 | "ACAGAAAAGCACTATTGTGACAAGTTTCACATCAAAATGGGCAAGGTCTTTGGCGTATTCAAAGATCCGT\n" + 19 | "GCATGGCTGGTGGTAAAATGTTTGCCATACTAAAAAATACCTCTTGGTCGAACCAGTGCCAAGGAAACCA\n" + 20 | "TGTCAGCACCATTCATCTTGTCCTTCAGAGTAATTTCAAACAGGTCCTCAGTAGCAGGAAACTGTTGAAC\n" + 21 | "TTTTTCAGCTGGTCATTGTCTGATGCCACAGGGGCTGATATGCCTGGTGGTTTTTGTCTGGAAAAATGGA\n" + 22 | "TGTTGATTTCAAGTGAACTGAAATGCTTTGGAAACACAGCTGTGGCAAAGTGCAACTTAAATCATGACTC\n" + 23 | "AGAGTTCTGTGACATGCTTAGGCTTTTTGATTTCAACAAAAAGGCAATAGTCACTCTTCAGAACAAAACA\n" + 24 | "AAGCATCGGCTGGACACAGTAATTACTGCTATCAATTCATTGATCTCTGATAATATTCTTATGAAGAACA\n" + 25 | "GGATTAAAGAATTGATAGATGTTCCTTACTGTAATTACACCAAATTTTGGTATGTCAATCACACAGGTCT\n" + 26 | "AAATCTGCACACCCTTCCAAGATGTTGGCTTGTTAAAAATGGTAGCTACTTGAATGTGTCTGACTTCAGG\n" + 27 | "AATGAGTGGATATTGGAGAGTGATCATCTTGTTTCGGAGATCCTTTCAAAGGAGTATGAGGAAAGGCAAA\n" + 28 | "ATCGTACACCACTCTCACTGGTTGACATCTGTTTCTGGAGTACATTGTTTTACACAGCATCAATTTTCCT\n" + 29 | "ACACCTCTTGAGAATTCCAACCCACAGACACATTGTTGGTGAGGGCTGCCCGAAGCCTCATAGGCTAAAC\n" + 30 | "AGGCACTCAATATGTGCTTGTGGCCTTTTCAAACAAGAAGGCAGACCCTTGAGATGGGTAAGAAAGGTGT\n" + 31 | "GAACAATGGTTGCTTGGTGGCCTCCATTGCTGCACCCCCCTAGGGGGGTGCAGCAATGGAGGTTCTCGYT\n" + 32 | "GAGCCTAGAGAACAACTGTTGAATCGGGTTCTCTAAAGAGAACATCGATTGGTAGTACCCTTTTTGGTTT\n" + 33 | "TTCATTGGTCACTGACCCTGAAAGCACAGCACTGAACATCAAACAGTCCAAAAGTGCACAGTGTGCATTT\n" + 34 | "GTTGTGGCTGGTGCTGATCCTTTCTTCTTACTTTTAATGACTATTCCCTTATGTCTGTCACACAGATGTT\n" + 35 | "CAAATCTCTTCCAAACAAGATCTTCAAAGAGCCGTGACTGTTCTGCGGTCAGTTTGACATCAACAATCTT\n" + 36 | "CAAATCCTGTCTTCCATGCATATCAAAGAGCCTCCTAATATCATCAGCACCTTGCGCAGTGAAAACCATG\n" + 37 | "GATTTAGGCAGACTCCTTATTATGCTTGTGATGAGGCCAGGTCGTGCATGTTCAACATCCTTCAGCAATA\n" + 38 | "TCCCATGACAATATTTACTTTGGTCCTTAAAAGATTTTATGTCATTGGGTTTTCTGTAGCAGTGGATGAA\n" + 39 | "TTTTTGTGATTCAGGCTGGTAAATTGCAAACTCAACAGGGTCATGTGGCGGGCCTTCAATGTCAATCCAT\n" + 40 | "GTTGTGTCACTGACCATCAACGACTCTACACTTCTCTTCACCTGAGCCTCCACCTCAGGCTTGAGCGTGG\n" + 41 | "ACAAGAGTGGGGCACCACCGTTCCGGATGGGGACTGGTGTTTTGCTTGGTAAACTCTCAAATTCCACAAC\n" + 42 | "TGTATTGTCCCATGCTCTCCCTTTGATCTGTGATCTTGATGAAATGTAAGGCCAGCCCTCACCAGAGAGA\n" + 43 | "CACACCTTATAAAGTATGTTTTCATAAGGATTCCTCTGTCCTGGTATGGCACTGATGAACATGTTTTCCC\n" + 44 | "TCTTTTTGATCTCCAAGAGGGTTTTTATAATGGTTGTGAATGTGGACTCCTCAATCTTTATTGTTTCCAG\n" + 45 | "CATGTTGCCACCATCAATCAGGCAAGCACCGGCTTTCACAGCAGCTGATAAACTAAGGTTGTAGCCTGAT\n" + 46 | "ATGTTAATTTGAGAATCCTCCTGAGTGATTACCTTTAGAGAAGGATGCTTCTCCATCAAAGCATCTAAGT\n" + 47 | "CACTTAAATTAGGGTATTTTGCTGTGTATAGCAACCCCAGATCTGTGAGGGCCTGAACCACATCATTTAG\n" + 48 | "AGTTTCCCCTCCCTGTTCAGTCATACAGGAAATTGTGAGTGCTGGCATCGATCCAAATTGGTTGATCATA\n" + 49 | "AGTGATGAGTCTTTAACGTCCCAGACTTTGACCACCCCTCCAGTTCTAGCCAACCCAGGTCTCTGAATAC\n" + 50 | "CAACAAGTTGCAGAATTTCGGACCTCCTGGTGAGCTGTGTTGTAGAGAGGTTCCCTAGATACTGGCCACC\n" + 51 | "TGTGGCTGTCAACCTCTCTGTTCTTTGAACTTTTTGCCTTAATTTGTCCAAGTCACTGGAGAGTTCCATT\n" + 52 | "AGCTCTTCCTTTGACAATGATCCTATCTTAAGGAACATGTTCTTTTGGGTTGACTTCATGACCATCAATG\n" + 53 | "AGTCAACTTCCTTATTCAAGTCCCTCAAACTAACAAGATCACTGTCATCTCTTTTAGACCTCCTCATCAT\n" + 54 | "GCGTTGCACACTTGCAACCTTTGAAAAATCTAAGCCGGACAGAAGAGCCCTCGCGTCAGTTAGGACATCT\n" + 55 | "GCCTTAACAGCAGTTGTCCAGTTCGAGAGTCCTCTCCTGAGAGACTGTGTCCATCTGAATGATGGGATTG\n" + 56 | "GTTGTTCGCTCATAGTGATGAAATTGCGCAGAGTTATCCAAAAGCCTAGGATCCTCTGTGCG"; 57 | 58 | 59 | try { 60 | 61 | // parse the raw sequence from the string 62 | InputStream stream = new ByteArrayInputStream(dnaFastaS.getBytes()); 63 | 64 | // define the Ambiguity Compound Sets 65 | AmbiguityDNACompoundSet ambiguityDNACompoundSet = AmbiguityDNACompoundSet.getDNACompoundSet(); 66 | CompoundSet<NucleotideCompound> nucleotideCompoundSet = AmbiguityRNACompoundSet.getRNACompoundSet(); 67 | 68 | FastaReader<DNASequence, NucleotideCompound> proxy = 69 | new FastaReader<DNASequence, NucleotideCompound>( 70 | stream, 71 | new GenericFastaHeaderParser<DNASequence, NucleotideCompound>(), 72 | new DNASequenceCreator(ambiguityDNACompoundSet)); 73 | 74 | // has only one entry in this example, but could be easily extended to parse a FASTA file with multiple sequences 75 | LinkedHashMap<String, DNASequence> dnaSequences = proxy.process(); 76 | 77 | // Initialize the Transcription Engine 78 | TranscriptionEngine engine = new 79 | TranscriptionEngine.Builder().dnaCompounds(ambiguityDNACompoundSet).rnaCompounds(nucleotideCompoundSet).build(); 80 | 81 | Frame[] sixFrames = Frame.getAllFrames(); 82 | 83 | for (DNASequence dna : dnaSequences.values()) { 84 | 85 | Map<Frame, Sequence<AminoAcidCompound>> results = engine.multipleFrameTranslation(dna, sixFrames); 86 | 87 | for (Frame frame : sixFrames){ 88 | System.out.println("Translated Frame:" + frame +" : " + results.get(frame)); 89 | } 90 | 91 | } 92 | } catch (Exception e){ 93 | e.printStackTrace(); 94 | } 95 | ``` 96 | 97 | This code will print out: 98 | 99 | ``` 100 | Translated Frame:ONE : RTEDPRRY*LALITDTVSYLDKDPAN*LVSMGQLVSLFSEIPSIIHEALNVALVAVSIIAILKGVVNVWKSGVLQLLAFLLLAGRSCSVIIGHHLELQHVIFNGSSITPFLPVTCKINDTYFLLRGPYEADWAVELSVTETTVLVDLEGGSSMKLKAGNISGCLGDNPHLRSVVFTLNWLLTGLDHVIDSDPKILCDLKDSGHFRLQMNLTEKHYCDKFHIKMGKVFGVFKDPCMAGGKMFAILKNTSWSNQCQGNHVSTIHLVLQSNFKQVLSSRKLLNFFSWSLSDATGADMPGGFCLEKWMLISSELKCFGNTAVAKCNLNHDSEFCDMLRLFDFNKKAIVTLQNKTKHRLDTVITAINSLISDNILMKNRIKELIDVPYCNYTKFWYVNHTGLNLHTLPRCWLVKNGSYLNVSDFRNEWILESDHLVSEILSKEYEERQNRTPLSLVDICFWSTLFYTASIFLHLLRIPTHRHIVGEGCPKPHRLNRHSICACGLFKQEGRPLRWVRKV*TMVAWWPPLLHPPRGVQQWRFSXSLENNC*IGFSKENIDW*YPFWFFIGH*P*KHSTEHQTVQKCTVCICCGWC*SFLLTFNDYSLMSVTQMFKSLPNKIFKEP*LFCGQFDINNLQILSSMHIKEPPNIISTLRSENHGFRQTPYYACDEARSCMFNILQQYPMTIFTLVLKRFYVIGFSVAVDEFL*FRLVNCKLNRVMWRAFNVNPCCVTDHQRLYTSLHLSLHLRLERGQEWGTTVPDGDWCFAW*TLKFHNCIVPCSPFDL*S**NVRPALTRETHLIKYVFIRIPLSWYGTDEHVFPLFDLQEGFYNGCECGLLNLYCFQHVATINQASTGFHSS**TKVVA*YVNLRILLSDYL*RRMLLHQSI*VT*IRVFCCV*QPQICEGLNHII*SFPSLFSHTGNCECWHRSKLVDHK**VFNVPDFDHPSSSSQPRSLNTNKLQNFGPPGELCCREVP*ILATCGCQPLCSLNFLP*FVQVTGEFH*LFL*Q*SYLKEHVLLG*LHDHQ*VNFLIQVPQTNKITVISFRPPHHALHTCNL*KI*AGQKSPRVS*DICLNSSCPVRESSPERLCPSE*WDWLFAHSDEIAQSYPKA*DPLC 101 | Translated Frame:TWO : AQRILGVTDLR**QILFHI*IKTQPTDWSAWDNLCPSSVKFHQSYTKLSMLLS*LLASLQY*KGL*MFGRVEFCSFWPSCSWREDPAQS*LVIISNCSM*SSMGHQSHPFYQLHVRSMIPTSY*EAPMKLIGQLN*V*LKPQSWLILKVAAQ*S*KPETSQVVLETTPI*DQWSSH*IGC*QD*IMLLILTRKFSVILKTVGTFVSR*T*QKSTIVTSFTSKWARSLAYSKIRAWLVVKCLPY*KIPLGRTSAKETMSAPFILSFRVISNRSSVAGNC*TFSAGHCLMPQGLICLVVFVWKNGC*FQVN*NALETQLWQSAT*IMTQSSVTCLGFLISTKRQ*SLFRTKQSIGWTQ*LLLSIH*SLIIFL*RTGLKN**MFLTVITPNFGMSITQV*ICTPFQDVGLLKMVAT*MCLTSGMSGYWRVIILFRRSFQRSMRKGKIVHHSHWLTSVSGVHCFTQHQFSYTS*EFQPTDTLLVRAARSLIG*TGTQYVLVAFSNKKADP*DG*ERCEQWLLGGLHCCTPLGGCSNGGSX*A*RTTVESGSLKRTSIGSTLFGFSLVTDPESTALNIKQSKSAQCAFVVAGADPFFLLLMTIPLCLSHRCSNLFQTRSSKSRDCSAVSLTSTIFKSCLPCISKSLLISSAPCAVKTMDLGRLLIMLVMRPGRACSTSFSNIP*QYLLWSLKDFMSLGFL*QWMNFCDSGW*IANSTGSCGGPSMSIHVVSLTINDSTLLFT*ASTSGLSVDKSGAPPFRMGTGVLLGKLSNSTTVLSHALPLICDLDEM*GQPSPERHTL*SMFS*GFLCPGMALMNMFSLFLISKRVFIMVVNVDSSIFIVSSMLPPSIRQAPAFTAADKLRL*PDMLI*ESS*VITFREGCFSIKASKSLKLGYFAVYSNPRSVRA*TTSFRVSPPCSVIQEIVSAGIDPNWLIISDESLTSQTLTTPPVLANPGL*IPTSCRISDLLVSCVVERFPRYWPPVAVNLSVL*TFCLNLSKSLESSISSSFDNDPILRNMFFWVDFMTINESTSLFKSLKLTRSLSSLLDLLIMRCTLATFEKSKPDRRALASVRTSALTAVVQFESPLLRDCVHLNDGIGCSLIVMKLRRVIQKPRILCA 102 | Translated Frame:THREE : HRGS*ALLTCANNRYCFIFR*RPSQLIGQHGTTCVPLQ*NSINHTRSSQCCSRSC*HHCNIERGCECLEEWSFAAFGLLAPGGKILLSHNWSSSRTAACDLQWVINHTLFTSYM*DQ*YLLPTKRPL*S*LGS*IECN*NHSLG*S*RWQLNEAESRKHLRLSWRQPPSEISGLHIELVANRIRSCY*F*PENSL*S*RQWALSSPDELNRKALL*QVSHQNGQGLWRIQRSVHGWW*NVCHTKKYLLVEPVPRKPCQHHSSCPSE*FQTGPQ*QETVELFQLVIV*CHRG*YAWWFLSGKMDVDFK*TEMLWKHSCGKVQLKS*LRVL*HA*AF*FQQKGNSHSSEQNKASAGHSNYCYQFIDL**YSYEEQD*RIDRCSLL*LHQILVCQSHRSKSAHPSKMLAC*KW*LLECV*LQE*VDIGE*SSCFGDPFKGV*GKAKSYTTLTG*HLFLEYIVLHSINFPTPLENSNPQTHCW*GLPEAS*AKQALNMCLWPFQTRRQTLEMGKKGVNNGCLVASIAAPP*GGAAMEVLXEPREQLLNRVL*REHRLVVPFLVFHWSLTLKAQH*TSNSPKVHSVHLLWLVLILSSYF**LFPYVCHTDVQISSKQDLQRAVTVLRSV*HQQSSNPVFHAYQRAS*YHQHLAQ*KPWI*ADSLLCL**GQVVHVQHPSAISHDNIYFGP*KILCHWVFCSSG*IFVIQAGKLQTQQGHVAGLQCQSMLCH*PSTTLHFSSPEPPPQA*AWTRVGHHRSGWGLVFCLVNSQIPQLYCPMLSL*SVILMKCKASPHQRDTPYKVCFHKDSSVLVWH**TCFPSF*SPRGFL*WL*MWTPQSLLFPACCHHQSGKHRLSQQLIN*GCSLIC*FENPPE*LPLEKDASPSKHLSHLN*GILLCIATPDL*GPEPHHLEFPLPVQSYRKL*VLASIQIG*S*VMSL*RPRL*PPLQF*PTQVSEYQQVAEFRTSW*AVL*RGSLDTGHLWLSTSLFFELFALICPSHWRVPLALPLTMILS*GTCSFGLTS*PSMSQLPYSSPSN*QDHCHLF*TSSSCVAHLQPLKNLSRTEEPSRQLGHLP*QQLSSSRVLS*ETVSI*MMGLVVRS***NCAELSKSLGSSV 103 | Translated Frame:REVERSED_ONE : RTEDPRLLDNSAQFHHYERTTNPIIQMDTVSQERTLELDNCC*GRCPN*REGSSVRLRFFKGCKCATHDEEV*KR*Q*SC*FEGLE*GS*LIDGHEVNPKEHVP*DRIIVKGRANGTLQ*LGQIKAKSSKNREVDSHRWPVSREPLYNTAHQEVRNSATCWYSETWVG*NWRGGQSLGR*RLITYDQPIWIDASTHNFLYD*TGRGNSK*CGSGPHRSGVAIHSKIP*FK*LRCFDGEASFSKGNHSGGFSN*HIRLQP*FISCCESRCLPD*WWQHAGNNKD*GVHIHNHYKNPLGDQKEGKHVHQCHTRTEESL*KHTL*GVSLW*GLALHFIKITDQRESMGQYSCGI*EFTKQNTSPHPERWCPTLVHAQA*GGGSGEEKCRVVDGQ*HNMD*H*RPAT*PC*VCNLPA*ITKIHPLLQKTQ*HKIF*GPK*ILSWDIAEGC*TCTTWPHHKHNKESA*IHGFHCARC**Y*EAL*YAWKTGFEDC*CQTDRRTVTAL*RSCLEEI*TSV*QT*GNSH*K*EERISTSHNKCTLCTFGLFDVQCCAFRVSDQ*KTKKGTTNRCSL*RTRFNSCSLGSXRTSIAAPP*GGAAMEATKQPLFTPFLPISRVCLLV*KGHKHILSACLAYEASGSPHQQCVCGLEFSRGVGKLMLCKTMYSRNRCQPVRVVYDFAFPHTPLKGSPKQDDHSPISTHS*SQTHSSSYHF*QANILEGCADLDLCD*HTKIWCNYSKEHLSIL*SCSS*EYYQRSMN**Q*LLCPADALFCSEE*LLPFC*NQKA*ACHRTLSHDLSCTLPQLCFQSISVHLKSTSIFPDKNHQAYQPLWHQTMTS*KSSTVSCY*GPV*NYSEGQDEWC*HGFLGTGSTKRYFLVWQTFYHQPCTDL*IRQRPCPF*CETCHNSAFLLSSSGDESAHCL*DHREFSGQNQ*HDLILLATNSM*RPLISDGGCLQDNLRCFRLSASLSCHLQDQPRLWFQLHSIQLPNQLHRGLLVGSRYH*SYM*LVKRV*LMTH*RSHAAVRDDDQL*LSRIFPPGARRPKAAKLHSSKHSQPLSILQ*C*QLREQH*ELRV*LMEFH*RGTQVVPC*PISWLGLYLNMKQYLLLAQVSNA*DPLC 104 | Translated Frame:REVERSED_TWO : AQRILGFWITLRNFITMSEQPIPSFRWTQSLRRGLSNWTTAVKADVLTDARALLSGLDFSKVASVQRMMRRSKRDDSDLVSLRDLNKEVDSLMVMKSTQKNMFLKIGSLSKEELMELSSDLDKLRQKVQRTERLTATGGQYLGNLSTTQLTRRSEILQLVGIQRPGLARTGGVVKVWDVKDSSLMINQFGSMPALTISCMTEQGGETLNDVVQALTDLGLLYTAKYPNLSDLDALMEKHPSLKVITQEDSQINISGYNLSLSAAVKAGACLIDGGNMLETIKIEESTFTTIIKTLLEIKKRENMFISAIPGQRNPYENILYKVCLSGEGWPYISSRSQIKGRAWDNTVVEFESLPSKTPVPIRNGGAPLLSTLKPEVEAQVKRSVESLMVSDTTWIDIEGPPHDPVEFAIYQPESQKFIHCYRKPNDIKSFKDQSKYCHGILLKDVEHARPGLITSIIRSLPKSMVFTAQGADDIRRLFDMHGRQDLKIVDVKLTAEQSRLFEDLVWKRFEHLCDRHKGIVIKSKKKGSAPATTNAHCALLDCLMFSAVLSGSVTNEKPKRVLPIDVLFREPDSTVVL*AXREPPLLHPPRGVQQWRPPSNHCSHLSYPSQGSAFLFEKATSTY*VPV*PMRLRAALTNNVSVGWNSQEV*EN*CCVKQCTPETDVNQ*EWCTILPFLILL*KDLRNKMITLQYPLIPEVRHIQVATIFNKPTSWKGVQI*TCVIDIPKFGVITVRNIYQFFNPVLHKNIIRDQ*IDSSNYCVQPMLCFVLKSDYCLFVEIKKPKHVTEL*VMI*VALCHSCVSKAFQFT*NQHPFFQTKTTRHISPCGIRQ*PAEKVQQFPATEDLFEITLKDKMNGADMVSLALVRPRGIF*YGKHFTTSHARIFEYAKDLAHFDVKLVTIVLFC*VHLETKVPTVFKITENFRVRINNMI*SC*QPIQCEDH*SQMGVVSKTT*DVSGFQLH*AATFKINQDCGFSYTQFNCPISFIGAS**EVGIIDLTCNW*KGCD**PIEDHMLQFEMMTNYD*AGSSRQEQEGQKLQNSTLPNIHNPFQYCNDANSYESNIESFVYD*WNFTEEGHKLSHADQSVGWVFI*I*NSICY*RKSVTPRILCA 105 | Translated Frame:REVERSED_THREE : HRGS*AFG*LCAISSL*ANNQSHHSDGHSLSGEDSRTGQLLLRQMS*LTRGLFCPA*IFQRLQVCNA**GGLKEMTVILLV*GT*IRKLTH*WS*SQPKRTCSLR*DHCQRKS*WNSPVTWTN*GKKFKEQRG*QPQVASI*GTSLQHSSPGGPKFCNLLVFRDLGWLELEGWSKSGTLKTHHL*STNLDRCQHSQFPV*LNREGKL*MMWFRPSQIWGCYTQQNTLI*VT*ML*WRSILL*R*SLRRILKLTYQATTLVYQLL*KPVLA*LMVATCWKQ*RLRSPHSQPL*KPSWRSKRGKTCSSVPYQDRGILMKTYFIRCVSLVRAGLTFHQDHRSKGEHGTIQLWNLRVYQAKHQSPSGTVVPHSCPRSSLRWRLR*REV*SR*WSVTQHGLTLKARHMTLLSLQFTSLNHKNSSTATENPMT*NLLRTKVNIVMGYC*RMLNMHDLASSQA**GVCLNPWFSLRKVLMILGGSLICMEDRI*RLLMSN*PQNSHGSLKILFGRDLNICVTDIRE*SLKVRRKDQHQPQQMHTVHFWTV*CSVLCFQGQ*PMKNQKGYYQSMFSLENPIQQLFSRLXENLHCCTPLGGCSNGGHQATIVHTFLTHLKGLPSCLKRPQAHIECLFSL*GFGQPSPTMCLWVGILKRCRKIDAV*NNVLQKQMSTSESGVRFCLSSYSFERISETR*SLSNIHSFLKSDTFK*LPFLTSQHLGRVCRFRPV*LTYQNLV*LQ*GTSINSLILFFIRILSEINELIAVITVSSRCFVLF*RVTIAFLLKSKSLSMSQNSES*FKLHFATAVFPKHFSSLEINIHFSRQKPPGISAPVASDNDQLKKFNSFLLLRTCLKLL*RTR*MVLTWFPWHWFDQEVFFSMANILPPAMHGSLNTPKTLPILM*NLSQ*CFSVKFIWRRKCPLSLRSQRIFGSESIT*SNPVSNQFNVKTTDLRWGLSPRQPEMFPAFSFIELPPSRSTKTVVSVTLNSTAQSAS*GPLSRK*VSLILHVTGKKGVIDDPLKITCCSSR**PIMTEQDLPARSKKAKSCKTPLFQTFTTPFNIAMMLTATRATLRASCMIDGISLKRDTSCPMLTNQLAGSLSKYETVSVISASQ*RLGSSV 106 | ``` 107 | <!--automatically generated footer--> 108 | 109 | --- 110 | 111 | Navigation: 112 | [Home](../README.md) 113 | | [Book 1: The Core Module](README.md) 114 | | Chapter 4 : Translating 115 | 116 | Prev: [Chapter 3 : Reading and Writing sequences](readwrite.md) 117 | -------------------------------------------------------------------------------- /genomics/README.md: -------------------------------------------------------------------------------- 1 | The BioJava - Genomics Module 2 | ===================================================== 3 | 4 | A tutorial for the genomics module of [BioJava](http://www.biojava.org) 5 | 6 | ## About 7 | <table> 8 | <tr> 9 | <td> 10 | <img src="img/genomics.png"/> 11 | </td> 12 | <td> 13 | The <i>genome</i> module of BioJava provides an API that allows to 14 | <ul> 15 | <li>Parse popular file formats used in genomcs</li> 16 | <li>Convert from one file format to another</li> 17 | <li>Translate DNA sequences into protein sequences</li> 18 | </ul> 19 | </td> 20 | </tr> 21 | </table> 22 | 23 | ## Index 24 | 25 | This tutorial is split into several chapters. 26 | 27 | Chapter 1 - Quick [Installation](installation.md) 28 | 29 | Chapter 2 - Reading [gene names information](genenames.md) from genenames.org 30 | 31 | Chapter 3 - Reading [chromosomal positions](chromosomeposition.md) for genes. (UCSC's refFlat.txt.gz ) 32 | 33 | Chapter 4 - Reading [GTF and GFF files](gff.md) 34 | 35 | Chapter 5 - Reading and writing a [Genebank](genebank.md) file 36 | 37 | Chapter 5 - Reading [karyotype (cytoband)](karyotype.md) files 38 | 39 | Chapter 6 - Reading genomic DNA sequences using UCSC's [.2bit file format](twobit.md) 40 | 41 | 42 | ## License 43 | 44 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). 45 | 46 | ## Please Cite 47 | 48 | **BioJava 5: A community driven open-source bioinformatics library**<br/> 49 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte* <br/> 50 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791) <br/> 51 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) 52 | 53 | 54 | 55 | <!--automatically generated footer--> 56 | 57 | --- 58 | 59 | Navigation: 60 | [Home](../README.md) 61 | | Book 4: The Genomics Module 62 | 63 | Prev: [Book 3: The Structure Modules](../structure/README.md) 64 | -------------------------------------------------------------------------------- /genomics/chromosomeposition.md: -------------------------------------------------------------------------------- 1 | Parse Chromosomal Information of Genes 2 | ====================================== 3 | 4 | BioJava contains a parser the [refFlat.txt.gz](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz) 5 | from the UCSC genome browser that contains a mapping of gene names to chromosome positions. 6 | 7 | 8 | ```java 9 | try { 10 | 11 | List<GeneChromosomePosition> genePositions= GeneChromosomePositionParser.getChromosomeMappings(); 12 | System.out.println("got " + genePositions.size() + " gene positions") ; 13 | 14 | for (GeneChromosomePosition pos : genePositions){ 15 | if ( pos.getGeneName().equals("FOLH1")) { 16 | System.out.println(pos); 17 | break; 18 | } 19 | } 20 | 21 | } catch(Exception e){ 22 | e.printStackTrace(); 23 | } 24 | ``` 25 | 26 | If a local copy of the file is available, it can be provide via this: 27 | 28 | 29 | ```java 30 | 31 | URL url = new URL("file://local/copy/of/file"); 32 | 33 | InputStreamProvider prov = new InputStreamProvider(); 34 | 35 | InputStream inStream = prov.getInputStream(url); 36 | 37 | GeneChromosomePositionParser.getChromosomeMappings(inStream); 38 | 39 | 40 | 41 | ``` 42 | <!--automatically generated footer--> 43 | 44 | --- 45 | 46 | Navigation: 47 | [Home](../README.md) 48 | | [Book 4: The Genomics Module](README.md) 49 | | Chapter 3 : chromosomal positions 50 | 51 | Prev: [Chapter 2 : gene names information](genenames.md) 52 | 53 | Next: [Chapter 4 : GTF and GFF files](gff.md) 54 | -------------------------------------------------------------------------------- /genomics/genebank.md: -------------------------------------------------------------------------------- 1 | Reading and writing a Genbank file 2 | ================================== 3 | 4 | There are multiple ways how to read a Genbank file. 5 | 6 | ## Method 1: Read a Genbank file using the GenbankProxySequenceReader 7 | 8 | ```java 9 | GenbankProxySequenceReader<AminoAcidCompound> genbankProteinReader = 10 | new GenbankProxySequenceReader<AminoAcidCompound>("/tmp", "NP_000257", 11 | AminoAcidCompoundSet.getAminoAcidCompoundSet()); 12 | ProteinSequence proteinSequence = new ProteinSequence(genbankProteinReader); 13 | genbankProteinReader.getHeaderParser().parseHeader( 14 | genbankProteinReader.getHeader(), proteinSequence); 15 | System.out.format("Sequence(%s,%d)=%s...", 16 | proteinSequence.getAccession(), 17 | proteinSequence.getLength(), 18 | proteinSequence.getSequenceAsString().substring(0, 10)); 19 | 20 | GenbankProxySequenceReader<NucleotideCompound> genbankDNAReader = 21 | new GenbankProxySequenceReader<NucleotideCompound>("/tmp", "NM_001126", 22 | DNACompoundSet.getDNACompoundSet()); 23 | DNASequence dnaSequence = new DNASequence(genbankDNAReader); 24 | genbankDNAReader.getHeaderParser().parseHeader(genbankDNAReader.getHeader(), dnaSequence); 25 | System.out.format("Sequence(%s,%d)=%s...", dnaSequence.getAccession(), 26 | dnaSequence.getLength(), 27 | dnaSequence.getSequenceAsString().substring(0, 10)); 28 | ``` 29 | 30 | 31 | ## Method 2: Read a Genbank file using GenbankReaderHelper 32 | 33 | ```java 34 | File dnaFile = new File("src/test/resources/NM_000266.gb"); 35 | File protFile = new File("src/test/resources/BondFeature.gb"); 36 | 37 | LinkedHashMap<String, DNASequence> dnaSequences = 38 | GenbankReaderHelper.readGenbankDNASequence( dnaFile ); 39 | for (DNASequence sequence : dnaSequences.values()) { 40 | System.out.println( sequence.getSequenceAsString() ); 41 | } 42 | 43 | LinkedHashMap<String, ProteinSequence> protSequences = 44 | GenbankReaderHelper.readGenbankProteinSequence(protFile); 45 | for (ProteinSequence sequence : protSequences.values()) { 46 | System.out.println( sequence.getSequenceAsString() ); 47 | ``` 48 | 49 | ## Method 3: Read a Genbank file using the GenbankReader Object 50 | 51 | ```java 52 | FileInputStream is = new FileInputStream(dnaFile); 53 | GenbankReader<DNASequence, NucleotideCompound> dnaReader = 54 | new GenbankReader<DNASequence, NucleotideCompound>( 55 | is, 56 | new GenericGenbankHeaderParser<DNASequence,NucleotideCompound>(), 57 | new DNASequenceCreator(DNACompoundSet.getDNACompoundSet()) 58 | ); 59 | dnaSequences = dnaReader.process(); 60 | is.close(); 61 | System.out.println(dnaSequences); 62 | 63 | is = new FileInputStream(protFile); 64 | GenbankReader<ProteinSequence, AminoAcidCompound> protReader = 65 | new GenbankReader<ProteinSequence, AminoAcidCompound>( 66 | is, 67 | new GenericGenbankHeaderParser<ProteinSequence,AminoAcidCompound>(), 68 | new ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()) 69 | ); 70 | protSequences = protReader.process(); 71 | is.close(); 72 | System.out.println(protSequences); 73 | ``` 74 | 75 | 76 | # Write a Genbank file 77 | 78 | 79 | Use the GenbankWriterHelper to write DNA sequences into a Genbank file. 80 | 81 | ```java 82 | // First let's read some DNA sequences from a genbank file 83 | 84 | File dnaFile = new File("src/test/resources/NM_000266.gb"); 85 | LinkedHashMap<String, DNASequence> dnaSequences = 86 | GenbankReaderHelper.readGenbankDNASequence( dnaFile ); 87 | ByteArrayOutputStream fragwriter = new ByteArrayOutputStream(); 88 | ArrayList<DNASequence> seqs = new ArrayList<DNASequence>(); 89 | for(DNASequence seq : dnaSequences.values()) { 90 | seqs.add(seq); 91 | } 92 | 93 | // ok now we got some DNA sequence data. Next step is to write it 94 | 95 | GenbankWriterHelper.writeNucleotideSequence(fragwriter, seqs, 96 | GenbankWriterHelper.LINEAR_DNA); 97 | 98 | // the fragwriter object now contains a string representation in the Genbank format 99 | // and you could write this into a file 100 | // or print it out on the console 101 | System.out.println(fragwriter.toString()); 102 | ``` 103 | 104 | <!--automatically generated footer--> 105 | 106 | --- 107 | 108 | Navigation: 109 | [Home](../README.md) 110 | | [Book 4: The Genomics Module](README.md) 111 | | Chapter 5 : Genebank 112 | 113 | Prev: [Chapter 4 : GTF and GFF files](gff.md) 114 | 115 | Next: [Chapter 5 : karyotype (cytoband)](karyotype.md) 116 | -------------------------------------------------------------------------------- /genomics/genenames.md: -------------------------------------------------------------------------------- 1 | Parse Gene Name Information 2 | =========================== 3 | 4 | The following code parses [a file from the www.genenames.org](http://www.genenames.org/cgi-bin/download?title=HGNC+output+data&hgnc_dbtag=on&col=gd_app_sym&col=gd_app_name&col=gd_status&col=gd_prev_sym&col=gd_prev_name&col=gd_aliases&col=gd_pub_chrom_map&col=gd_pub_acc_ids&col=md_mim_id&col=gd_pub_refseq_ids&col=md_ensembl_id&col=md_prot_id&col=gd_hgnc_id&status=Approved&status_opt=2&where=((gd_pub_chrom_map%20not%20like%20%27%patch%%27%20and%20gd_pub_chrom_map%20not%20like%20%27%ALT_REF%%27)%20or%20gd_pub_chrom_map%20IS%20NULL)%20and%20gd_locus_group%20%3d%20%27protein-coding%20gene%27&order_by=gd_app_sym_sort&format=text&limit=&submit=submit&.cgifields=&.cgifields=chr&.cgifields=status&.cgifields=hgnc_dbtag) 5 | website that contains a mapping of human gene names to other databases. 6 | 7 | 8 | ```java 9 | /** parses a file from the genenames website 10 | * 11 | * @param args 12 | */ 13 | public static void main(String[] args) { 14 | 15 | try { 16 | 17 | List<GeneName> geneNames = GeneNamesParser.getGeneNames(); 18 | 19 | System.out.println("got " + geneNames.size() + " gene names"); 20 | 21 | 22 | for ( GeneName g : geneNames){ 23 | if ( g.getApprovedSymbol().equals("FOLH1")) 24 | System.out.println(g); 25 | } 26 | // and returns a list of beans that contains key-value pairs for each gene name 27 | 28 | } catch (Exception e) { 29 | // TODO Auto-generated catch block 30 | e.printStackTrace(); 31 | } 32 | 33 | } 34 | ``` 35 | 36 | If you have a local copy of this file, then you can just provide an input stream for it: 37 | 38 | ```java 39 | 40 | URL url = new URL("file:///local/copy/of/file"); 41 | 42 | InputStreamProvider prov = new InputStreamProvider(); 43 | 44 | InputStream inStream = prov.getInputStream(url); 45 | 46 | GeneNamesParser.getGeneNames(inStream); 47 | 48 | 49 | ``` 50 | <!--automatically generated footer--> 51 | 52 | --- 53 | 54 | Navigation: 55 | [Home](../README.md) 56 | | [Book 4: The Genomics Module](README.md) 57 | | Chapter 2 : gene names information 58 | 59 | Prev: [Chapter 1 : Installation](installation.md) 60 | 61 | Next: [Chapter 3 : chromosomal positions](chromosomeposition.md) 62 | -------------------------------------------------------------------------------- /genomics/gff.md: -------------------------------------------------------------------------------- 1 | Reading GFF files 2 | ================= 3 | 4 | The biojava3-genome library leverages the sequence relationships in biojava3-core to read (gtf,gff2,gff3) files and 5 | write gff3 files. The file formats for gtf, gff2, gff3 are well defined but what gets written in the file is very 6 | flexible. We currently provide support for reading gff files generated by open source gene prediction applications 7 | GeneID, GeneMark and GlimmerHMM. Each prediction algorithm uses a different ontology to describe coding sequence, 8 | exons, start or stop codon which makes it difficult to write a general purpose gff parser that can create biologically 9 | meaningful objects. If the application is simply loading a gff file and drawing a colored glyph then you don't need to 10 | worry about the ontology used. It is easier to support the popular gene prediction algorithms by writing a parser that 11 | is aware of each gene prediction applications ontology. 12 | 13 | 14 | The following code example takes a 454scaffold file that was used by genemark to predict genes and returns a 15 | collection of ChromosomeSequences. Each chromosome sequence maps to a named entry in the fasta file and would 16 | contain N gene sequences. The gene sequences can be +/- strand with frame shifts and multiple transcriptions. 17 | 18 | Passing the collection of ChromsomeSequences to GeneFeatureHelper.getProteinSequences would return all protein 19 | sequences. You can then write the protein sequences to a fasta file. 20 | 21 | ```java 22 | 23 | LinkedHashMap<String, ChromosomeSequence> chromosomeSequenceList = GeneFeatureHelper.loadFastaAddGeneFeaturesFromGeneMarkGTF(new File("454Scaffolds.fna"), new File("genemark_hmm.gtf")); 24 | LinkedHashMap<String, ProteinSequence> proteinSequenceList = GeneFeatureHelper.getProteinSequences(chromosomeSequenceList.values()); 25 | FastaWriterHelper.writeProteinSequence(new File("genemark_proteins.faa"), proteinSequenceList.values()); 26 | ``` 27 | 28 | You can also output the gene sequence to a fasta file where the coding regions will be upper case and the non-coding regions will be lower case 29 | 30 | ```java 31 | LinkedHashMap<String, GeneSequence> geneSequenceHashMap = GeneFeatureHelper.getGeneSequences(chromosomeSequenceList.values()); 32 | Collection<GeneSequence> geneSequences = geneSequenceHashMap.values(); 33 | FastaWriterHelper.writeGeneSequence(new File("genemark_genes.fna"), geneSequences, true); 34 | 35 | ``` 36 | 37 | You can easily write out a gff3 view of a ChromosomeSequence with the following code. 38 | 39 | ```java 40 | FileOutputStream fo = new FileOutputStream("genemark.gff3"); 41 | GFF3Writer gff3Writer = new GFF3Writer(); 42 | gff3Writer.write(fo, chromosomeSequenceList); 43 | fo.close(); 44 | ``` 45 | 46 | The chromsome sequence becomes the middle layer that represents the essence of what is mapped in a gtf, gff2 or 47 | gff3 file. This makes it fairly easy to write code to convert from gtf to gff3 or from gff2 to gtf. The challenge 48 | is picking the correct ontology for writing into gtf or gff2 formats. You could use feature names used by a 49 | specific gene prediction program or features supported by your favorite genome browser. We would like to provide a 50 | complete set of java classes to do these conversions where the list of supported gene prediction applications and 51 | genome browsers will get longer based on end user requests. 52 | 53 | 54 | <!--automatically generated footer--> 55 | 56 | --- 57 | 58 | Navigation: 59 | [Home](../README.md) 60 | | [Book 4: The Genomics Module](README.md) 61 | | Chapter 4 : GTF and GFF files 62 | 63 | Prev: [Chapter 3 : chromosomal positions](chromosomeposition.md) 64 | 65 | Next: [Chapter 5 : Genebank](genebank.md) 66 | -------------------------------------------------------------------------------- /genomics/img/genomics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/genomics/img/genomics.png -------------------------------------------------------------------------------- /genomics/installation.md: -------------------------------------------------------------------------------- 1 | ## Quick Installation 2 | 3 | In the beginning, just one quick paragraph of how to get access to BioJava. 4 | 5 | BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way: 6 | 7 | BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html) guide. 8 | 9 | Currently, we are providing a BioJava specific Maven repository at (http://biojava.org/download/maven/) . 10 | 11 | You can add the BioJava repository by adding the following XML to your project pom.xml file: 12 | 13 | ```xml 14 | <repositories> 15 | ... 16 | <repository> 17 | <id>biojava-maven-repo</id> 18 | <name>BioJava repository</name> 19 | <url>http://www.biojava.org/download/maven/</url> 20 | </repository> 21 | </repositories> 22 | ``` 23 | 24 | We are currently in the process of changing our distribution to Maven Central, which would not even require this configuration step. 25 | 26 | ```xml 27 | <dependencies> 28 | ... 29 | 30 | <!-- This imports the latest version of BioJava genomics module --> 31 | <dependency> 32 | 33 | <groupId>org.biojava</groupId> 34 | <artifactId>biojava3-genomics</artifactId> 35 | <version>3.0.8</version> 36 | <!-- note: the genomics module depends on the BioJava-core module and will import it automatically --> 37 | </dependency> 38 | 39 | 40 | <!-- other biojava jars as needed --> 41 | 42 | </dependencies> 43 | ``` 44 | 45 | If you run 46 | 47 | <pre> 48 | mvn package 49 | </pre> 50 | 51 | on your project, the BioJava dependencies will be automatically downloaded and installed for you. 52 | 53 | 54 | <!--automatically generated footer--> 55 | 56 | --- 57 | 58 | Navigation: 59 | [Home](../README.md) 60 | | [Book 4: The Genomics Module](README.md) 61 | | Chapter 1 : Installation 62 | 63 | Next: [Chapter 2 : gene names information](genenames.md) 64 | -------------------------------------------------------------------------------- /genomics/karyotype.md: -------------------------------------------------------------------------------- 1 | Parsing a karyotype file from the UCSC genome browser 2 | ===================================================== 3 | 4 | Karyotype information for the human genome can be read from UCSC's [cytoBand.txt.gz](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz) 5 | file. 6 | 7 | ```java 8 | 9 | CytobandParser me = new CytobandParser(); 10 | try { 11 | SortedSet<Cytoband> cytobands = me.getAllCytobands(new URL(http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz)); 12 | SortedSet<StainType> types = new TreeSet<StainType>(); 13 | for (Cytoband c : cytobands){ 14 | System.out.println(c); 15 | if ( ! types.contains(c.getType())) 16 | types.add(c.getType()); 17 | } 18 | System.out.println(types); 19 | } catch (Exception e) { 20 | // TODO Auto-generated catch block 21 | e.printStackTrace(); 22 | } 23 | ``` 24 | 25 | If a local copy of the file is available you can specify it in the following way: 26 | 27 | ```java 28 | 29 | SortedSet<Cytoband> cytobands = me.getAllCytobands(new URL("file://path/to/local/copy/")); 30 | 31 | ``` 32 | <!--automatically generated footer--> 33 | 34 | --- 35 | 36 | Navigation: 37 | [Home](../README.md) 38 | | [Book 4: The Genomics Module](README.md) 39 | | Chapter 5 : karyotype (cytoband) 40 | 41 | Prev: [Chapter 5 : Genebank](genebank.md) 42 | 43 | Next: [Chapter 6 : .2bit file format](twobit.md) 44 | -------------------------------------------------------------------------------- /genomics/twobit.md: -------------------------------------------------------------------------------- 1 | Reading a .2bit file 2 | ==================== 3 | 4 | UCSC's .2bit files provide a compact representation of the DNA sequences for a genome. The TwoBitParser class provides 5 | the access to the content of this file. 6 | 7 | ```java 8 | File f = new File("/path/to/file.2bit"); 9 | TwoBitParser p = new TwoBitParser(File f); 10 | 11 | String[] names = p.getSequenceNames(); 12 | for(int i=0;i<names.length;i++) { 13 | p.setCurrentSequence(names[i]); 14 | p.printFastaSequence(); 15 | p.close(); 16 | } 17 | 18 | ``` 19 | <!--automatically generated footer--> 20 | 21 | --- 22 | 23 | Navigation: 24 | [Home](../README.md) 25 | | [Book 4: The Genomics Module](README.md) 26 | | Chapter 6 : .2bit file format 27 | 28 | Prev: [Chapter 5 : karyotype (cytoband)](karyotype.md) 29 | -------------------------------------------------------------------------------- /installation.md: -------------------------------------------------------------------------------- 1 | ## Quick Installation 2 | 3 | In the beginning, just one quick paragraph of how to get access to BioJava. 4 | 5 | BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way: 6 | 7 | BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html) guide. 8 | 9 | As of version 4, BioJava is available in maven central. This is all you would need to add a BioJava dependency to your projects: 10 | 11 | ```xml 12 | <dependencies> 13 | ... 14 | 15 | <!-- This imports the latest version of BioJava genomics module --> 16 | <dependency> 17 | 18 | <groupId>org.biojava</groupId> 19 |                        <artifactId>biojava-genome</artifactId> 20 | <version>4.2.0</version> 21 | <!-- note: the genomics module depends on the BioJava-core module and will import it automatically --> 22 | </dependency> 23 | 24 | 25 | <!-- other biojava jars as needed --> 26 | 27 | 28 | <!-- This imports the latest version of BioJava structure module --> 29 | <dependency> 30 | 31 | <groupId>org.biojava</groupId> 32 | <artifactId>biojava-structure</artifactId> 33 | <version>4.2.0</version> 34 | </dependency> 35 | </dependencies> 36 | ``` 37 | 38 | If you run 39 | 40 | <pre> 41 | mvn package 42 | </pre> 43 | 44 | on your project, the BioJava dependencies will be automatically downloaded and installed for you. 45 | 46 | -------------------------------------------------------------------------------- /logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/logo.png -------------------------------------------------------------------------------- /modfinder/README.md: -------------------------------------------------------------------------------- 1 | The ModFinder Module of BioJava 2 | ===================================================== 3 | 4 | A tutorial for the modfinder module of [BioJava](http://www.biojava.org) 5 | 6 | ## About 7 | <table> 8 | <tr> 9 | <td> 10 | <img src='https://cloud.githubusercontent.com/assets/840895/22190971/fe5cd304-e0f4-11e6-9eb5-c1b071312081.png'> 11 | </td> 12 | <td> 13 | The <i>modfinder</i> module of BioJava provides an API for identification of protein pre-, co-, and post-translational modifications from structures. 14 | </td> 15 | </tr> 16 | </table> 17 | 18 | ## Index 19 | 20 | This tutorial is split into several chapters. 21 | 22 | Chapter 1 - Quick [Installation](installation.md) 23 | 24 | Chapter 2 - [How to get the list of supported protein modifications](supported-protein-modifications.md) 25 | 26 | Chapter 3 - [How to identify protein modifications in a structure](identify-protein-modifications.md) 27 | 28 | Chapter 4 - [How to define a new protein modification](add-protein-modification.md) 29 | 30 | ## License 31 | 32 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). 33 | 34 | ## Please Cite 35 | 36 | **BioJava-ModFinder: identification of protein modifications in 3D structures from the Protein Data Bank**<br/> 37 | *Jianjiong Gao; Andreas Prlic; Chunxiao Bi; Wolfgang F. Bluhm; Dimitris Dimitropoulos; Dong Xu; Philip E. Bourne; Peter W. Rose* <br/> 38 | [Bioinformatics. 2017 Feb 17.](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx101) <br/> 39 | [![doi](http://img.shields.io/badge/doi-10.1093%2Fbioinformatics%2Fbtx101-blue.svg?style=flat)](https://doi.org/10.1093/bioinformatics/btx101) [![pubmed](http://img.shields.io/badge/pubmed-28334105-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/28334105) 40 | 41 | **BioJava 5: A community driven open-source bioinformatics library**<br/> 42 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte* <br/> 43 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791) <br/> 44 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) 45 | 46 | 47 | 48 | <!--automatically generated footer--> 49 | 50 | --- 51 | 52 | Navigation: 53 | [Home](../README.md) 54 | | Book 6: The ModFinder Module 55 | 56 | Prev: [Book 5: The Protein-Disorder Module Module](../protein-disorder/README.md) 57 | -------------------------------------------------------------------------------- /modfinder/add-protein-modification.md: -------------------------------------------------------------------------------- 1 | How to define a new protein modification? 2 | === 3 | 4 | The protmod module automatically loads [a list of protein modifications](supported-protein-modifications.md) into the protein modification registry. In case you have a protein modification that is not preloaded, it is possible to define it by yourself and add it into the registry. 5 | 6 | ## Example: define and register disulfide bond in java code 7 | 8 | ```java 9 | // define the involved components, in this case two cystines (CYS) 10 | List components = new ArrayList(2); 11 | components.add(Component.of("CYS")); 12 | components.add(Component.of("CYS")); 13 | 14 | // define the atom linkages between the components, in this case the SG atoms on both CYS groups 15 | ModificationLinkage linkage = new ModificationLinkage(components, 0, “SG”, 1, “SG”); 16 | 17 | // define the modification condition, i.e. what components are involved and what atoms are linked between them 18 | ModificationCondition condition = new ModificationConditionImpl(components, Collections.singletonList(linkage)); 19 | 20 | // build a modification 21 | ProteinModification mod = 22 | new ProteinModificationImpl.Builder("0018_test", 23 | ModificationCategory.CROSS_LINK_2, 24 | ModificationOccurrenceType.NATURAL, 25 | condition) 26 | .setDescription("A protein modification that effectively cross-links two L-cysteine residues to form L-cystine.") 27 | .setFormula("C 6 H 8 N 2 O 2 S 2") 28 | .setResidId("AA0025") 29 | .setResidName("L-cystine") 30 | .setPsimodId("MOD:00034") 31 | .setPsimodName("L-cystine (cross-link)") 32 | .setSystematicName("(R,R)-3,3'-disulfane-1,2-diylbis(2-aminopropanoic acid)") 33 | .addKeyword("disulfide bond") 34 | .addKeyword("redox-active center") 35 | .build(); 36 | 37 | //register the modification 38 | ProteinModificationRegistry.register(mod); 39 | ``` 40 | 41 | ## Example: definedisulfide bond in xml file and register by java code 42 | ```xml 43 | <ProteinModifications> 44 | <Entry> 45 | <Id>0018</Id> 46 | <Description>A protein modification that effectively cross-links two L-cysteine residues to form L-cystine.</Description> 47 | <SystematicName>(R,R)-3,3'-disulfane-1,2-diylbis(2-aminopropanoic acid)</SystematicName> 48 | <CrossReference> 49 | <Source>RESID</Source> 50 | <Id>AA0025</Id> 51 | <Name>L-cystine</Name> 52 | </CrossReference> 53 | <CrossReference> 54 | <Source>PSI-MOD</Source> 55 | <Id>MOD:00034</Id> 56 | <Name>L-cystine (cross-link)</Name> 57 | </CrossReference> 58 | <Condition> 59 | <Component component="1"> 60 | <Id source="PDBCC">CYS</Id> 61 | </Component> 62 | <Component component="2"> 63 | <Id source="PDBCC">CYS</Id> 64 | </Component> 65 | <Bond> 66 | <Atom component="1">SG</Atom> 67 | <Atom component="2">SG</Atom> 68 | </Bond> 69 | </Condition> 70 | <Occurrence>natural</Occurrence> 71 | <Category>crosslink2</Category> 72 | <Keyword>redox-active center</Keyword> 73 | <Keyword>disulfide bond</Keyword> 74 | </Entry> 75 | </ProteinModifications> 76 | ``` 77 | 78 | ```java 79 | FileInputStream fis = new FileInputStream("path/to/file"); 80 | ProteinModificationXmlReader.registerProteinModificationFromXml(fis); 81 | ``` 82 | 83 | 84 | Navigation: 85 | [Home](../README.md) 86 | | [Book 6: The ModFinder Modules](README.md) 87 | | Chapter 4 - How to define a new protein modification 88 | 89 | Prev: [Chapter 3 : How to identify protein modifications in a structure](identify-protein-modifications.md) 90 | 91 | -------------------------------------------------------------------------------- /modfinder/identify-protein-modifications.md: -------------------------------------------------------------------------------- 1 | How to identify protein modifications in a structure? 2 | === 3 | 4 | ## Example: Identify and print all preloaded modifications from a structure 5 | 6 | ```java 7 | Set<ModifiedCompound> identifyAllModfications(Structure struc) { 8 | ProteinModificationIdentifier parser = new ProteinModificationIdentifier(); 9 | parser.identify(struc); 10 | Set<ModifiedCompound> mcs = parser.getIdentifiedModifiedCompound(); 11 | return mcs; 12 | } 13 | ``` 14 | 15 | ## Example: Identify phosphorylation sites in a structure 16 | 17 | ```java 18 | List identifyPhosphosites(Structure struc) { 19 | List<ResidueNumber> phosphosites = new ArrayList<>(); 20 | ProteinModificationIdentifier parser = new ProteinModificationIdentifier(); 21 | parser.identify(struc, ProteinModificationRegistry.getByKeyword("phosphoprotein")); 22 | Set<ModifiedCompound> mcs = parser.getIdentifiedModifiedCompound(); 23 | for (ModifiedCompound mc : mcs) { 24 | Set<StructureGroup> groups = mc.getGroups(true); 25 | for (StructureGroup group : groups) { 26 | phosphosites.add(group.getPDBResidueNumber()); 27 | } 28 | } 29 | return phosphosites; 30 | } 31 | ``` 32 | 33 | ## Demo code to run the above methods 34 | 35 | ```java 36 | import org.biojava.nbio.structure.ResidueNumber; 37 | import org.biojava.nbio.structure.Structure; 38 | import org.biojava.nbio.structure.io.PDBFileReader; 39 | import org.biojava.nbio.protmod.structure.ProteinModificationIdentifier; 40 | 41 | public static void main(String[] args) { 42 | try { 43 | PDBFileReader reader = new PDBFileReader(); 44 | reader.setAutoFetch(true); 45 | 46 | // identify all modificaitons from PDB:1CAD and print them 47 | String pdbId = "1CAD"; 48 | Structure struc = reader.getStructureById(pdbId); 49 | Set<ModifiedCompound> mcs = identifyAllModfications(struc); 50 | for (ModifiedCompound mc : mcs) { 51 | System.out.println(mc.toString()); 52 | } 53 | 54 | // identify all phosphosites from PDB:3MVJ and print them 55 | pdbId = "3MVJ"; 56 | struc = reader.getStructureById(pdbId); 57 | List<ResidueNumber> psites = identifyPhosphosites(struc); 58 | for (ResidueNumber psite : psites) { 59 | System.out.println(psite.toString()); 60 | } 61 | } catch(Exception e) { 62 | e.printStackTrace(); 63 | } 64 | } 65 | ``` 66 | 67 | 68 | Navigation: 69 | [Home](../README.md) 70 | | [Book 6: The ModFinder Modules](README.md) 71 | | Chapter 3 - How to identify protein modifications in a structure 72 | 73 | Prev: [Chapter 2 : How to get a list of supported protein modifications](supported-protein-modifications.md) 74 | 75 | Next: [Chapter 4 : How to define a new protein modification](add-protein-modification.md) 76 | -------------------------------------------------------------------------------- /modfinder/installation.md: -------------------------------------------------------------------------------- 1 | ## Quick Installation 2 | 3 | In the beginning, just one quick paragraph of how to get access to BioJava. 4 | 5 | BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way: 6 | 7 | BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html) guide. 8 | 9 | As of version 4, BioJava is available in maven central. This is all you would need to add BioJava dependencies to your project in the `pom.xml` file: 10 | 11 | ```xml 12 | <dependencies> 13 | ... 14 | <dependency> 15 | <!-- This imports the latest SNAPSHOT builds from the protein structure modules of BioJava. 16 | --> 17 | <groupId>org.biojava</groupId> 18 | <artifactId>biojava-structure</artifactId> 19 | <version>4.2.0</version> 20 | </dependency> 21 | <dependency> 22 | <!-- This imports the latest SNAPSHOT builds from the protein modfinder modules of BioJava. 23 | --> 24 | <groupId>org.biojava</groupId> 25 | <artifactId>biojava-modfinder</artifactId> 26 | <version>4.2.0</version> 27 | </dependency> 28 | <!-- other biojava jars as needed --> 29 | </dependencies> 30 | ``` 31 | 32 | If you run 33 | 34 | <pre> 35 | mvn package 36 | </pre> 37 | 38 | on your project, the BioJava dependencies will be automatically downloaded and installed for you. 39 | 40 | 41 | <!--automatically generated footer--> 42 | 43 | --- 44 | 45 | Navigation: 46 | [Home](../README.md) 47 | | [Book 6: The ModFinder Modules](README.md) 48 | | Chapter 1 : Installation 49 | 50 | Next: [Chapter 2 : How to get the list of supported protein modifications](supported-protein-modifications.md) 51 | -------------------------------------------------------------------------------- /modfinder/supported-protein-modifications.md: -------------------------------------------------------------------------------- 1 | How to get a list of supported protein modifications? 2 | === 3 | 4 | The protmod module contains [an XML file](https://github.com/biojava/biojava/blob/master/biojava-modfinder/src/main/resources/org/biojava/nbio/protmod/ptm_list.xml), defining a list of protein modifications, retrieved from [Protein Data Bank Chemical Component Dictionary](http://www.wwpdb.org/ccd.html), [RESID](http://pir.georgetown.edu/resid/), and [PSI-MOD](http://www.psidev.info/MOD). It contains many common modifications such glycosylation, phosphorylation, acelytation, methylation, etc. Crosslinks are also included, such disulfide bonds and iso-peptide bonds. 5 | 6 | The protmod maintains a registry of supported protein modifications. The list of protein modifications contained in the XML file will be automatically loaded. You can [define and register a new protein modification](add-protein-modification.md) if it has not been defined in the XML file. From the protein modification registry, a user can retrieve: 7 | - all protein modifications, 8 | - a protein modification by ID, 9 | - a set of protein modifications by RESID ID, 10 | - a set of protein modifications by PSI-MOD ID, 11 | - a set of protein modifications by PDBCC ID, 12 | - a set of protein modifications by category (attachment, modified residue, crosslink1, crosslink2, …, crosslink7), 13 | - a set of protein modifications by occurrence type (natural or hypothetical), 14 | - a set of protein modifications by a keyword (glycoprotein, phosphoprotein, sulfoprotein, …), 15 | - a set of protein modifications by involved components. 16 | 17 | ## Examples 18 | 19 | ```java 20 | // a protein modification by ID 21 | ProteinModification mod = ProteinModificationRegistry.getById(“0001”); 22 | 23 | Set mods; 24 | 25 | // all protein modifications 26 | mods = ProteinModificationRegistry.allModifications(); 27 | 28 | // a set of protein modifications by RESID ID 29 | mods = ProteinModificationRegistry.getByResidId(“AA0151”); 30 | 31 | // a set of protein modifications by PSI-MOD ID 32 | mods = ProteinModificationRegistry.getByPsimodId(“MOD:00305”); 33 | 34 | // a set of protein modifications by PDBCC ID 35 | mods = ProteinModificationRegistry.getByPdbccId(“SEP”); 36 | 37 | // a set of protein modifications by category 38 | mods = ProteinModificationRegistry.getByCategory(ModificationCategory.ATTACHMENT); 39 | 40 | // a set of protein modifications by occurrence type 41 | mods = ProteinModificationRegistry.getByOccurrenceType(ModificationOccurrenceType.NATURAL); 42 | 43 | // a set of protein modifications by a keyword 44 | mods = ProteinModificationRegistry.getByKeyword(“phosphoprotein”); 45 | 46 | // a set of protein modifications by involved components. 47 | mods = ProteinModificationRegistry.getByComponent(Component.of(“FAD”)); 48 | 49 | ``` 50 | 51 | Navigation: 52 | [Home](../README.md) 53 | | [Book 6: The ModFinder Modules](README.md) 54 | | Chapter 2 - How to get a list of supported protein modifications 55 | 56 | Prev: [Chapter 1 : Installation](installation.md) 57 | 58 | Next: [Chapter 3 : How to identify protein modifications in a structure](identify-protein-modifications.md) 59 | -------------------------------------------------------------------------------- /protein-disorder/README.md: -------------------------------------------------------------------------------- 1 | The Protein-Disorder Module of BioJava 2 | ===================================================== 3 | 4 | A tutorial for the protein-disorder module of [BioJava](http://www.biojava.org) 5 | 6 | ## About 7 | <table> 8 | <tr> 9 | <td> 10 | 11 | </td> 12 | <td> 13 | The <i>protein-disorder module</i> of BioJava provide an API that allows to 14 | <ul> 15 | <li>predict protein-disorder using the JRONN algorithm</li> 16 | </ul> 17 | 18 | 19 | </td> 20 | </tr> 21 | </table> 22 | 23 | ## How can I predict disordered regions on a protein sequence? 24 | ----------------------------------------------------------- 25 | 26 | BioJava provide a module *biojava-protein-disorder* for prediction 27 | disordered regions from a protein sequence. Biojava-protein-disorder 28 | module for now contains one method for the prediction of disordered 29 | regions. This method is based on the Java implementation of 30 | [RONN](http://www.strubi.ox.ac.uk/RONN) predictor. 31 | 32 | This code has been originally developed for use with 33 | [JABAWS](http://www.compbio.dundee.ac.uk/jabaws). We call this code 34 | *JRONN*. *JRONN* is based on the C implementation of RONN algorithm and 35 | uses the same model data, therefore gives the same predictions. JRONN 36 | based on RONN version 3.1 which is still current in time of writing 37 | (August 2011). Main motivation behind JRONN development was providing an 38 | implementation of RONN more suitable to use by the automated analysis 39 | pipelines and web services. Robert Esnouf has kindly allowed us to 40 | explore the RONN code and share the results with the community. 41 | 42 | Original version of RONN is described in [Yang,Z.R., Thomson,R., 43 | McMeil,P. and Esnouf,R.M. (2005) RONN: the bio-basis function neural 44 | network technique applied to the detection of natively disordered 45 | regions in proteins. Bioinformatics 21: 46 | 3369-3376](http://bioinformatics.oxfordjournals.org/content/21/16/3369.full) 47 | 48 | Examples of use are provided below. For more information please refer to 49 | JronnExample testcases. 50 | 51 | Finally instead of an API calls you can use a [ command line 52 | utility](http://biojava.org/wikis/BioJava:CookBook3:ProteinDisorderCLI/ "wikilink"), which is 53 | likely to give you a better performance as it uses multiple threads to 54 | perform calculations. 55 | 56 | Example 1: Calculate the probability of disorder for every residue in the sequence 57 | ---------------------------------------------------------------------------------- 58 | 59 | ```java 60 | FastaSequence fsequence = new FastaSequence("name", 61 | "LLRGRHLMNGTMIMRPWNFLNDHHFPKFFPHLIEQQAIWLADWWRKKHC" + 62 | "RPLPTRAPTMDQWDHFALIQKHWTANLWFLTFPFNDKWGWIWFLKDWTPGSADQAQRACTWFFCHGHDTN"); 63 | 64 | float[] rawProbabilityScores = Jronn.getDisorderScores(fsequence); 65 | ``` 66 | 67 | Example 2: Calculate the probability of disorder for every residue in the sequence for all proteins from the FASTA input file 68 | ----------------------------------------------------------------------------------------------------------------------------- 69 | 70 | ```java 71 | final List<FastaSequence> sequences = SequenceUtil.readFasta(new FileInputStream("src/test/resources/fasta.in")); 72 | Map<FastaSequence, float[]> rawProbabilityScores = Jronn.getDisorderScores(sequences); 73 | ``` 74 | 75 | Example 3: Get the disordered regions of the protein for a single protein sequence 76 | ---------------------------------------------------------------------------------- 77 | 78 | ```java 79 | FastaSequence fsequence = new FastaSequence("Prot1", "LLRGRHLMNGTMIMRPWNFLNDHHFPKFFPHLIEQQAIWLADWWRKKHC" + 80 |                "RPLPTRAPTMDQWDHFALIQKHWTANLWFLTFPFNDKWGWIWFLKDWTPGSADQAQRACTWFFCHGHDTN" + 81 |                "CQIIFEGRNAPERADPMWTGGLNKHIIARGHFFQSNKFHFLERKFCEMAEIERPNFTCRTLDCQKFPWDDP"); 82 | 83 | Range[] ranges = Jronn.getDisorder(fsequence); 84 | ``` 85 | 86 | Example 4: Calculate the disordered regions for the proteins from FASTA file 87 | ---------------------------------------------------------------------------- 88 | 89 | ```java 90 | final List<FastaSequence> sequences = SequenceUtil.readFasta(new FileInputStream("src/test/resources/fasta.in")); 91 | Map<FastaSequence, Range[]> ranges = Jronn.getDisorder(sequences); 92 | 93 | ``` 94 | 95 | ## License 96 | 97 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). 98 | 99 | ## Please Cite 100 | 101 | **BioJava 5: A community driven open-source bioinformatics library**<br/> 102 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte* <br/> 103 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791) <br/> 104 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) 105 | 106 | 107 | 108 | <!--automatically generated footer--> 109 | 110 | --- 111 | 112 | Navigation: 113 | [Home](../README.md) 114 | | Book 3: The Protein Structure modules 115 | 116 | Prev: [Book 4: The Genomics Module](../genomics/README.md) 117 | | Next: [Book 6: The ModFinder Module](../modfinder/README.md) 118 | -------------------------------------------------------------------------------- /structure/README.md: -------------------------------------------------------------------------------- 1 | The Structure Modules of BioJava 2 | ===================================================== 3 | 4 | A tutorial for the structure modules of [BioJava](http://www.biojava.org) 5 | 6 | ## About 7 | <table> 8 | <tr> 9 | <td> 10 | <img src="img/4hhb_jmol.png"/> 11 | </td> 12 | <td> 13 | The <i>protein structure modules</i> of BioJava provide an API that allows to 14 | <ul> 15 | <li>Maintain local installations of PDB</li> 16 | <li>Load structures and manipulate them</li> 17 | <li>Perform standard analysis such as sequence and structure alignments</li> 18 | <li>Visualize structures</li> 19 | </ul> 20 | This tutorial provides an overview of the most important functionalities. 21 | </td> 22 | </tr> 23 | </table> 24 | 25 | ## Index 26 | 27 | This tutorial is split into several chapters. 28 | 29 | 30 | Chapter 1 - Quick [Installation](installation.md) 31 | 32 | Chapter 2 - [First Steps](firststeps.md) 33 | 34 | Chapter 3 - The [Structure Data Model](structure-data-model.md), for the representation of macromolecular structures 35 | 36 | Chapter 4 - [Local Installations](caching.md) of PDB 37 | 38 | Chapter 5 - The [Chemical Component Dictionary](chemcomp.md) 39 | 40 | Chapter 6 - How to [Work with mmCIF/PDBx Files](mmcif.md) 41 | 42 | Chapter 7 - [SEQRES and ATOM Records](seqres.md), mapping to Uniprot (SIFTs) 43 | 44 | Chapter 8 - [Structure Alignments](alignment.md) 45 | 46 | Chapter 9 - [Biological Assemblies](bioassembly.md) 47 | 48 | Chapter 10 - [External Databases](externaldb.md) like SCOP & CATH 49 | 50 | Chapter 11 - [Accessible Surface Areas](asa.md) 51 | 52 | Chapter 12 - [Contacts Within a Chain and between Chains](contact-map.md) 53 | 54 | Chapter 13 - Finding all Interfaces in Crystal: [Crystal Contacts](crystal-contacts.md) 55 | 56 | Chapter 14 - [Protein Symmetry](symmetry.md) 57 | 58 | Chapter 15 - [Protein Secondary Structure](secstruc.md) 59 | 60 | Chapter 16 - Bonds 61 | 62 | Chapter 17 - [Special Cases](special.md) 63 | 64 | Chapter 18 - [Lists](lists.md) of PDB IDs and PDB [Status Information](lists.md) 65 | 66 | 67 | ## License 68 | 69 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](../license.md). 70 | 71 | ## Please Cite 72 | 73 | **BioJava 5: A community driven open-source bioinformatics library**<br/> 74 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte* <br/> 75 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791) <br/> 76 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006791-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006791) [![pubmed](https://img.shields.io/badge/pubmed-30735498-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/30735498) 77 | 78 | 79 | 80 | <!--automatically generated footer--> 81 | 82 | --- 83 | 84 | Navigation: 85 | [Home](../README.md) 86 | | Book 3: The Structure Modules 87 | 88 | Prev: [Book 2: The Alignment Module](../alignment/README.md) 89 | 90 | Next: [Book 4: The Genomics Module](../genomics/README.md) 91 | -------------------------------------------------------------------------------- /structure/alignment-data-model.md: -------------------------------------------------------------------------------- 1 | Structure Alignment Data Model 2 | === 3 | 4 | ## AFPChain Data Model 5 | 6 | The `AFPChain` data structure was designed to store pairwise structural 7 | alignments. The class functions as a bean, and contains many variables 8 | used internally by the alignment algorithms implemented in biojava. 9 | 10 | Some of the important stored variables are: 11 | * Algorithm Name 12 | * Optimal Alignment: described later. 13 | * Optimal RMSD: final and total RMSD value of the alignment. 14 | * TM-score 15 | * BlockRotationMatrix: rotation component of the superposition transformation. 16 | * BlockShiftVector: translation component of the superposition transformation. 17 | 18 | BioJava class: [org.biojava.bio.structure.align.model.AFPChain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/model/AFPChain.html) 19 | 20 | ### The Optimal Alignment 21 | 22 | The residue equivalencies of the alignment (EQRs) are described in the optimal 23 | alignment variable, a triple array of integers, where the indices stand for: 24 | 25 | ```java 26 | int[][][] optAln = afpChain.getOptAln(); 27 | int residue = optAln[block][chain][eqr]; 28 | ``` 29 | 30 | * **block**: the blocks divide the alignment into different parts. The 31 | division can be due to non-topological rearrangements (e.g. circular 32 | permutations) or due to flexible parts (e.g. domain switch). There can 33 | be any number of blocks in a structural alignment, defined by the structure 34 | alignment algorithm. 35 | * **chain**: in a pairwise alignment there are only two chains, or structures. 36 | * **eqr**: EQR stands for equivalent residue position, i.e. the alignment 37 | position. There are as many positions (EQRs) in a block as the length of 38 | the alignment block, and their number is equal for any of the two chains in 39 | the same block. 40 | 41 | In each entry (combination of the three indices described above) an integer 42 | is stored, which corresponds to the residue index in the specified chain, i.e. 43 | the index in the Atom array of the chain. In between the same block, the stored 44 | integers (residues) are always in increasing order. 45 | 46 | ### Examples 47 | 48 | Some examples of how to get the basic properties of an `AFPChain`: 49 | 50 | ```java 51 | afpChain.getAlgorithmName(); //Name of the algorithm that generated the alignment 52 | afpChain.getBlockNum(); //Number of blocks 53 | afpChain.getTMScore(); //TM-score 54 | afpChain.getTotalRmsdOpt() //Optimal RMSD 55 | afpChain.getBlockRotationMatrix()[0] //get the rotation matrix of the first block 56 | afpChain.getBlockShiftVector()[0] //get the translation vector of the first block 57 | ``` 58 | 59 | ### Overview 60 | 61 | As an overview, the `AFPChain` data model: 62 | 63 | * Only supports **pairwise alignments**, i.e. two chains or structures aligned. 64 | * Can support **flexible alignments** and **non-topological alignments**. 65 | However, their combinatation (a flexible alignment with topological rearrangements) 66 | can not be represented, because the blocks mean either one or the other. 67 | * Can not support **non-sequential alignments**, or they would require a new block 68 | for each EQR, because sequentiality of the residues is assumed inside each block. 69 | 70 | ## MultipleAlignment Data Model 71 | 72 | Since BioJava 4.1.0, a new data model is available to store structure alignments. 73 | The `MultipleAlignment` data structure is a general model that supports any of the 74 | following properties, and any combination: 75 | 76 | * **Multiple structures**: the model is no longer restricted to pairwise alignments. 77 | * **Non-topological alignments**: such as circular permutations or domain rearrangements. 78 | * **Flexible alignments**: parts of the alignment with different superposition 79 | transformation. 80 | 81 | In addtition, the data structure is not limited in the number and types of scores 82 | it can store, because the scores are stored in a key:value fashion, as it will be 83 | described later. 84 | 85 | BioJava class: [org.biojava.bio.structure.align.multiple.MultipleAlignment](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/MultipleAlignment.html) 86 | 87 | ### Object Hierarchy 88 | 89 | The biggest difference with `AFPChain` is that the `MultipleAlignment` data 90 | structure is object oriented. 91 | The hierarchy of sub-objects is represented below: 92 | 93 | <pre> 94 | MultipleAlignmentEnsemble 95 | | 96 | MultipleAlignment(s) 97 | | 98 | BlockSet(s) 99 | | 100 | Block(s) 101 | </pre> 102 | 103 | * **MultipleAlignmentEnsemble**: the ensemble is the top level of the hierarchy. 104 | As a top level, it stores information regarding creation properties (algorithm, 105 | version, creation time, etc.), the structures involved in the alignment (Atoms, 106 | structure identifiers, etc.) and cached variables (atomic distance matrices). 107 | It contains a collection of `MultipleAlignment` that share the same properties 108 | stored in the ensemble. This construction allows the storage of alternative 109 | alignments inside the same data structure. 110 | 111 | * **MultipleAlignment**: the `MultipleAlignment` stores the core information of a 112 | multiple structure alignment. It is designed to be the return type of the multiple 113 | structure alignment algorithms. The object contains a collection of `BlockSet` and 114 | it is linked to its parent `MultipleAlignmentEnsemble`. 115 | 116 | * **BlockSet**: the `BlockSet` stores a flexible part of a multiple structure 117 | alignment. A flexible part needs the residue equivalencies involved, contained in 118 | a collection of `Block`, and a transformation matrix for every structure that 119 | describes the 3D superposition of all structures. It is linked to its parent 120 | `MultipleAlignment`. 121 | 122 | * **Block**: the `Block` stores the aligned positions (equivalent residues) of a 123 | `BlockSet` that are in sequentially increasing order. Each `Block` represents a 124 | sequential part of a non-topological alignment, if more than one `Block` is present. 125 | It is linked to its parent `BlockSet`. 126 | 127 | ### The Optimal Alignment 128 | 129 | In the `MultipleAlignment` data structure the aligned residues are stored in a 130 | double List for every `Block`. The indices of the double List are the following: 131 | 132 | ```java 133 | List<List<Integer>> optAln = block.getAlnRes(); 134 | Integer residue = optAln.get(chain).get(eqr); 135 | ``` 136 | 137 | The indices mean the same as in the optimal alignment of the `AFPChain`, just to 138 | remember them: 139 | 140 | * **chain**: chain or structure index. 141 | * **eqr**: EQR stands for equivalent residue position, i.e. the alignment 142 | position. There are as many positions (EQRs) in a block as the length of 143 | the alignment block, and their number is equal for any of the chains in 144 | the same block. 145 | 146 | As in `AFPChain`, each entry (combination of the two indices described above) 147 | is an Integer that corresponds to the residue index in the specified chain, i.e. 148 | the index in the Atom array of the chain. Caution has to be taken in the code, 149 | because a `MultipleAlignment` can contain gaps, which are represented as `null` 150 | in the List entries. 151 | 152 | ### Alignment Scores 153 | 154 | All the objects in the hierarchy levels implement the `ScoresCache` interface. 155 | This interface allows the storage of any number of scores as a key:value set. 156 | The key is a `String` that describes the score and used to recover it after, 157 | and the value is a double with the calculated score. The interface has only 158 | two methods: putScore and getScore. 159 | 160 | The following lines of code are an example on how to do score manipulations 161 | on a `MultipleAlignment`: 162 | 163 | ```java 164 | //Put a score into the alignment and get it back 165 | alignment.putScore('myRMSD', 1.234); 166 | double myRMSD = alignment.getScore('myRMSD'); 167 | 168 | BlockSet bs = alignment.getBlockSets().get(0); 169 | //The same can be done for BlockSets 170 | alignment.putScore('bsRMSD', 1.234); 171 | double bsRMSD = alignment.getScore('bsRMSD'); 172 | ``` 173 | 174 | ### Manipulating Multiple Alignments 175 | 176 | Some classes are designed to contain utility methods for manipulating a `MultipleAlignment` object. 177 | The most important ones are ennumerated and briefly described below: 178 | 179 | * [MultipleAlignmentScorer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentScorer.html): contains frequent names for scores and methods to calculate them. 180 | 181 | * [MultipleAlignmentTools](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentTools.html): contains helper methods, such as sequence alignment calculation, transform atom arrays of the structures or calculate aligned residue distances between all structures. 182 | 183 | * [MultipleAlignmentWriter](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentWriter.html): contains methods to generate different types of String outputs of the alignment, e.g. FASTA, XML, FatCat. 184 | 185 | * [MultipleSuperimposer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleSuperimposer.html): interface for implementations that calculate the structure superpositions of the alignment. Some examples of implementations are the ReferenceSuperimposer (superimposes all the structures to a reference) and the CoreSuperimposer (only uses EQRs present in all structures, without gaps, to superimpose them). 186 | 187 | * [MultipleAlignmentXMLParser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/xml/MultipleAlignmentXMLParser.html): contains a method to create a `MultipleAlignment` object from an XML file representation. 188 | 189 | ### Overview 190 | 191 | As an overview, the `MultipleAlignment` data model: 192 | 193 | * Supports any number of aligned structures, **multiple structures**. 194 | * Can support **flexible alignments** and **non-topological alignments**, 195 | and any of their combinatations (e.g. a flexible alignment with topological 196 | rearrangements). 197 | * Can not support **non-sequential alignments**, or they would require a new 198 | `Block` for each EQR, because sequentiality of the residues is a requirement 199 | for each `Block`. 200 | * Can store **any score** in any of the four object hierarchy level, making it 201 | easy to adapt to new requirements and algorithms. 202 | 203 | For more examples and information about the `MultipleAlignment` data structure 204 | go to the Demo package on the biojava-structure module or look through the interface 205 | files, where the javadoc explanations can be found. 206 | 207 | ## Conversion between Data Models 208 | 209 | The conversion from an `AFPChain` to a `MultipleAlignment` is possible trough the 210 | ensemble constructor. An example on how to do it programatically is below: 211 | 212 | ```java 213 | AFPChain afpChain; 214 | Atom[] chain1; 215 | Atom[] chain2; 216 | boolean flexible = false; 217 | MultipleAlignmentEnsemble ensemble = new MultipleAlignmentEnsemble(afpChain, chain1, chain2, false); 218 | MultipleAlignment converted = ensemble.getMultipleAlignment(0); 219 | ``` 220 | 221 | There is no method to convert from a `MultipleAlignment` to an `AFPChain`, because 222 | the first representation supports any number of structures, while the second is 223 | only supporting pairwise alignments. However, the conversion can be done with some 224 | lines of code if needed (instantiate a new `AFPChain` and copy one by one the 225 | properties that can be represented from the `MultipleAlignment`). 226 | 227 | === 228 | 229 | Go back to [Chapter 8 : Structure Alignments](alignment.md). 230 | -------------------------------------------------------------------------------- /structure/alignment.md: -------------------------------------------------------------------------------- 1 | Structure Alignments 2 | =========================== 3 | 4 | ## What is a Structure Alignment? 5 | 6 | A **structural alignment** attempts to establish equivalences between two or 7 | more polymer structures based on their shape and three-dimensional conformation. 8 | In contrast to simple structural superposition (see below), where at least some 9 | equivalent residues of the two structures are known, structural alignment requires 10 | no a priori knowledge of equivalent positions. 11 | 12 | A **structural alignment** is a valuable tool for the comparison of proteins with 13 | low sequence similarity, where evolutionary relationships between proteins cannot 14 | be easily detected by standard sequence alignment techniques. Therefore, a 15 | **structural alignment** can be used to imply evolutionary relationships between 16 | proteins that share very little common sequence. However, caution should be exercised 17 | when using the results as evidence for shared evolutionary ancestry, because of the 18 | possible confounding effects of convergent evolution by which multiple unrelated amino 19 | acid sequences converge on a common tertiary structure. 20 | 21 | A **structural alignment** of other biological polymers can also be made in BioJava. 22 | For example, nucleic acids can be structurally aligned to find common structural motifs, 23 | independent of sequence similarity. This is specially important for RNAs, because their 24 | 3D structure arrangement is important for their function. 25 | 26 | For more info see the Wikipedia article on [structure alignment](http://en.wikipedia.org/wiki/Structural_alignment). 27 | 28 | ## Alignment Algorithms Supported by BioJava 29 | 30 | BioJava comes with a number of algorithms for aligning structures. The following 31 | five options are displayed by default in the graphical user interface (GUI), 32 | although others can be accessed programmatically using the methods in 33 | [StructureAlignmentFactory](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/StructureAlignmentFactory.html). 34 | 35 | 1. Combinatorial Extension (CE) 36 | 2. Combinatorial Extension with Circular Permutation (CE-CP) 37 | 3. FATCAT - rigid 38 | 4. FATCAT - flexible. 39 | 5. Smith-Waterman superposition 40 | 41 | **CE** and **FATCAT** both use structural similarity to align the structures, while 42 | **Smith-Waterman** performs a local sequence alignment and then displays the result 43 | in 3D. See below for descriptions of the algorithms. 44 | 45 | Since BioJava version 4.1.0, multiple structures can be compared at the same time in 46 | a **multiple structure alignment**, that can later be visualized in Jmol. 47 | The algorithm is described in detail below. As an overview, it uses any pairwise alignment 48 | algorithm and a **reference** structure to perform an alignment of all the structures. 49 | Then, it runs a **Monte Carlo** optimization to determine the residue equivalencies among 50 | all the structures, identifying conserved **structural motifs**. 51 | 52 | ## Alignment User Interface 53 | 54 | Before going the details how to use the algorithms programmatically, let's take 55 | a look at the user interface that comes with the *biojava-structure-gui* module. 56 | 57 | ### Pairwise Alignment GUI 58 | 59 | Generating an instance of the GUI is just one line of code: 60 | 61 | ```java 62 | AlignmentGui.getInstance(); 63 | ``` 64 | 65 | This code shows the following user interface: 66 | 67 | ![Alignment GUI](img/alignment_gui.png) 68 | 69 | You can manually select structure chains, domains, or custom files to be aligned. 70 | Try to align 2hyn vs. 1zll. This will show the results in a graphical way, in 71 | 3D: 72 | 73 | ![3D Alignment of PDB IDs 2hyn and 1zll](img/2hyn_1zll.png) 74 | 75 | and also a 2D display, that interacts with the 3D display 76 | 77 | ![2D Alignment of PDB IDs 2hyn and 1zll](img/alignmentpanel.png) 78 | 79 | ### Multiple Alignment GUI 80 | 81 | Because of the inherent difference between multiple and pairwise alignments, 82 | a separate GUI is used to trigger multiple structural alignments. Generating 83 | an instance of the GUI is analogous to the pairwise alignment GUI: 84 | 85 | ```java 86 | MultipleAlignmentGUI.getInstance(); 87 | ``` 88 | 89 | This code shows the following user interface: 90 | 91 | ![Multiple Alignment GUI](img/multiple_gui.png) 92 | 93 | The input format is a free text field, where the structure identifiers are 94 | indicated, space separated. A **structure identifier** is a String that 95 | uniquely identifies a structure. It is basically composed of the pdbID, the 96 | chain letters and the ranges of residues of each chain. For the formal description 97 | visit [StructureIdentifier](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIdentifier.html). 98 | 99 | As an example, a multiple structure alignment of 6 globins is shown here. 100 | Their structure identifiers are shown in the previous figure of the GUI. 101 | The results are shown in a graphical way, as for the pairwise alignments: 102 | 103 | ![3D Globin Multiple Alignment](img/multiple_jmol_globins.png) 104 | 105 | The only difference with the Pairwise Alignment View is the possibility to show 106 | a subset of structures to be visualized, by checking the boxes under the 3D 107 | window and pressing the Show Only button afterwards. 108 | 109 | A **sequence alignment panel** that interacts with the 3D display can also be shown. 110 | 111 | ![3D Globin Multiple Panel](img/multiple_panel_globins.png) 112 | 113 | Explore the coloring options in the *Edit* menu, and through the *View* menu for 114 | alternative representations of the alignment. 115 | 116 | The functionality to perform and visualize these alignments can also be 117 | used from your own code. Let's first have a look at the alignment algorithms. 118 | 119 | ## Pairwise Alignment Algorithms 120 | 121 | ### Combinatorial Extension (CE) 122 | 123 | The Combinatorial Extension (CE) algorithm was originally developed by 124 | [Shindyalov and Bourne in 125 | 1998](http://peds.oxfordjournals.org/content/11/9/739.short) [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/9796821). 126 | It works by identifying segments of the two structures with similar local 127 | structure, and then combining those to try to align the most residues possible 128 | while keeping the overall root-mean-square deviation (RMSD) of the superposition low. 129 | 130 | CE is a rigid-body alignment algorithm, which means that the structures being 131 | compared are kept fixed during superposition. In some cases it may be desirable 132 | to break large proteins up into domains prior to aligning them (by manually 133 | inputting a subrange, using the [SCOP or CATH databases](externaldb.md), or by 134 | decomposing the protein automatically using the [Protein Domain 135 | Parser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/domain/LocalProteinDomainParser.html) 136 | algorithm). 137 | 138 | BioJava class: [org.biojava.bio.structure.align.ce.CeMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeMain.html) 139 | 140 | ### Combinatorial Extension with Circular Permutation (CE-CP) 141 | 142 | CE and FATCAT both assume that aligned residues occur in the same order in both 143 | structures (e.g. they are both *sequence-order dependent* algorithms). In proteins 144 | related by a circular permutation, the N-terminal part of one protein is related 145 | to the C-terminal part of the other, and vice versa. CE-CP allows circularly 146 | permuted proteins to be compared. For more information on circular 147 | permutations, see the 148 | [Wikipedia](http://en.wikipedia.org/wiki/Circular_permutation_in_proteins) or 149 | [Molecule of the Month](https://pdb101.rcsb.org/motm/124) 150 | articles [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628). 151 | 152 | 153 | For proteins without a circular permutation, CE-CP results look very similar to 154 | CE results (with perhaps some minor differences and a slightly longer 155 | calculation time). If a circular permutation is found, the two halves of the 156 | proteins will be shown in different colors: 157 | 158 | ![Concanavalin A (yellow & orange) aligned with Pea Leptin (blue and cyan)](img/3cna.A_2pel.A_cecp.png) 159 | 160 | CE-CP was developed by Spencer E. Bliven, Philip E. Bourne, and Andreas Prlić. 161 | 162 | BioJava class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html) 163 | 164 | ### FATCAT - rigid 165 | 166 | This is a Java implementation of the original FATCAT algorithm by [Yuzhen Ye 167 | & Adam Godzik in 168 | 2003](http://bioinformatics.oxfordjournals.org/content/19/suppl_2/ii246.abstract) 169 | [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/14534198). 170 | It performs similarly to CE for most structures. The 'rigid' flavor uses a 171 | rigid-body superposition and only considers alignments with matching sequence 172 | order. 173 | 174 | BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html) 175 | 176 | ### FATCAT - flexible 177 | 178 | FATCAT-flexible introduces 'twists' between different parts of the structures 179 | which are superimposed independently. This is ideal for proteins which undergo 180 | large conformational shifts, where a global superposition cannot capture the 181 | underlying similarity between domains. For instance, the structures of 182 | calmodulin with and without calcium bound can be much better aligned with 183 | FATCAT-flexible than with one of the rigid alignment algorithms. The downside of 184 | this is that it can lead to additional false positives in unrelated structures. 185 | 186 | ![(Left) Rigid and (Right) flexible alignments of calmodulin](img/1cfd_1cll_fatcat.png) 187 | 188 | BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html) 189 | 190 | ### Smith-Waterman 191 | 192 | This aligns residues based on Smith and Waterman's 1981 algorithm for local 193 | *sequence* alignment [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/7265238). No structural information is included in the alignment, so 194 | this only works for structures with significant sequence similarity. It uses the 195 | Blosum65 scoring matrix. 196 | 197 | The two structures are superimposed based on this alignment. Be aware that errors 198 | locating gaps can lead to high RMSD in the resulting superposition due to a 199 | small number of badly aligned residues. However, this method is faster than 200 | the structure-based methods. 201 | 202 | BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html) 203 | 204 | ### Other methods 205 | 206 | The following methods are not presented in the user interface by default: 207 | 208 | * [BioJavaStructureAlignment](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/BioJavaStructureAlignment.html) 209 | A structure-based alignment method able of returning multiple alternate 210 | alignments. It was written by Andreas Prlić and based on the PSC++ algorithm 211 | provided by Peter Lackner. 212 | * [CeSideChainMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeSideChainMain.html) 213 | A variant of CE using CB-CB distances, which sometimes improves alignments in 214 | proteins with parallel sheets and helices. 215 | * [OptimalCECPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/OptimalCECPMain.html) 216 | An alternate (much slower) algorithm for finding circular permutations. 217 | 218 | Additional methods can be added by implementing the 219 | [StructureAlignment](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/StructureAlignment.html) 220 | interface. 221 | 222 | ## Multiple Structure Alignment 223 | 224 | This Java implementation for multiple structure alignments, named MultipleMC, is based on the original CE-MC implementation by [Guda C, Scheeff ED, Bourne PE & Shindyalov IN in 2001](http://psb.stanford.edu/psb-online/proceedings/psb01/abstracts/p275.html) 225 | [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/11262947). 226 | 227 | The idea remains unchanged: perform **all-to-all pairwise alignments** of the structures, choose the 228 | **reference** as the most similar structure to all others and run a **Monte Carlo optimization** of 229 | the multiple residue equivalencies (EQRs) to minimize a score function that depends on the inter-residue 230 | distances. 231 | 232 | However, some details of the implementation have been changed in the BioJava version. 233 | They are described in the main class, as a summary: 234 | 235 | 1. It accepts **any pairwise alignment** algorithm (instead of being attached to CE), so any 236 | of the algorithms described before is suitable for generating a seed for optimization. Note that 237 | this property allows *non-topological* and *flexible* multiple structure alignments, always restricted 238 | by the pairwise alignment algorithm limitations. 239 | 2. The **moves** in the Monte Carlo optimization have been simplified to 3. 240 | 3. A **new move** to insert and delete individual gaps has been added. 241 | 4. The scoring function has been modified to a **continuous** function, maintaining the properties that the authors described. 242 | 5. The **probability function** is normalized in synchronization with the optimization progression, to improve the convergence into a maximum score after some random exploration of the multidimensional alignment space. 243 | 244 | The algorithm performs similarly to other multiple structure alignment algorithms for most protein families. 245 | The parameters both for the pairwise aligner and the MC optimization can have an impact on the final result. There is not a unique set of parameters, because they usually depend on the specific use case. Thus, trying some parameter combinations, keeping in mind the effect they produce in the score function, is a good practice when doing any structure alignment. 246 | 247 | BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html) 248 | 249 | 250 | ## Creating Alignments Programmatically 251 | 252 | The **pairwise structure alignment** algorithms in BioJava implement the 253 | `StructureAlignment` interface, and are usually accessed through 254 | `StructureAlignmentFactory`. Here's an example of how to create a CE-CP 255 | alignment and print some information about it. 256 | 257 | ```java 258 | // Fetch CA atoms for the structures to be aligned 259 | String name1 = "3cna.A"; 260 | String name2 = "2pel"; 261 | AtomCache cache = new AtomCache(); 262 | Atom[] ca1 = cache.getAtoms(name1); 263 | Atom[] ca2 = cache.getAtoms(name2); 264 | 265 | // Get StructureAlignment instance 266 | StructureAlignment algorithm = StructureAlignmentFactory.getAlgorithm(CeCPMain.algorithmName); 267 | 268 | // Perform the alignment 269 | AFPChain afpChain = algorithm.align(ca1,ca2); 270 | 271 | // Print text output 272 | System.out.println(afpChain.toCE(ca1,ca2)); 273 | ``` 274 | 275 | To display the alignment using Jmol, use: 276 | 277 | ```java 278 | GuiWrapper.display(afpChain, ca1, ca2); 279 | // Or using the biojava-structure-gui module 280 | StructureAlignmentDisplay.display(afpChain, ca1, ca2); 281 | ``` 282 | 283 | Note that these require that you include the structure-gui package and the Jmol 284 | binary in the classpath at runtime. 285 | 286 | For creating **multiple structure alignments**, the code is a little bit different, because the 287 | returned data structure and the number of input structures are different. Here is an 288 | example of how to create and display a multiple alignment: 289 | 290 | ```java 291 | //Specify the structures to align: some ASP-proteinases 292 | List<String> names = Arrays.asList("3app", "4ape", "5pep", "1psn", "4cms", "1bbs.A", "1smr.A"); 293 | 294 | //Load the CA atoms of the structures and create the structure identifiers 295 | AtomCache cache = new AtomCache(); 296 | List<Atom[]> atomArrays = new ArrayList<Atom[]>(); 297 | List<StructureIdentifier> identifiers = new ArrayList<StructureIdentifier>(); 298 | for (String name:names) { 299 | atomArrays.add(cache.getAtoms(name)); 300 | identifiers.add(new SubstructureIdentifier(name)); 301 | } 302 | 303 | //Generate the multiple alignment algorithm with the chosen pairwise algorithm 304 | StructureAlignment pairwise = StructureAlignmentFactory.getAlgorithm(CeMain.algorithmName); 305 | MultipleMcMain multiple = new MultipleMcMain(pairwise); 306 | 307 | //Perform the alignment 308 | MultipleAlignment result = multiple.align(atomArrays); 309 | 310 | // Set the structure identifiers, so that each atom array can be identified in the outputs 311 | result.getEnsemble().setStructureIdentifiers(identifiers); 312 | 313 | //Output the FASTA sequence alignment 314 | System.out.println(MultipleAlignmentWriter.toFASTA(result)); 315 | 316 | //Display the results in a 3D view 317 | MultipleAlignmentJmolDisplay.display(result); 318 | ``` 319 | 320 | ## Command-Line Tools 321 | 322 | Many of the alignment algorithms are available in the form of command line 323 | tools. These can be accessed through the main methods of the StructureAlignment 324 | classes. 325 | 326 | Example: 327 | ```bash 328 | runCE.sh -pdb1 4hhb.A -pdb2 4hhb.B -show3d 329 | ``` 330 | 331 | Using the command line tool it is possible to run pairwise alignments, several 332 | alignments in batch mode, or full database searches. Some additional parameters 333 | are available which are not exposed in the GUI, such as outputting results to a 334 | file in various formats. 335 | 336 | ## Alignment Data Model 337 | 338 | For details about the structure alignment data models in BioJava, see [Structure Alignment Data Model](alignment-data-model.md) 339 | 340 | ## Acknowledgements 341 | 342 | Thanks to P. Bourne, Yuzhen Ye and A. Godzik for granting permission to freely use and redistribute their algorithms. 343 | 344 | <!--automatically generated footer--> 345 | 346 | --- 347 | 348 | Navigation: 349 | [Home](../README.md) 350 | | [Book 3: The Structure Modules](README.md) 351 | | Chapter 8 : Structure Alignments 352 | 353 | Prev: [Chapter 7 : SEQRES and ATOM Records](seqres.md) 354 | 355 | Next: [Chapter 9 : Biological Assemblies](bioassembly.md) 356 | -------------------------------------------------------------------------------- /structure/asa.md: -------------------------------------------------------------------------------- 1 | # Calculating Accessible Surface Areas 2 | 3 | BioJava can also do calculation of Accessible Surface Areas (ASA) through an implementation of the rolling ball algorithm of Shrake and Rupley [Shrake 1973]. 4 | 5 | This code will do the ASA calculation and output the values per residue and the total: 6 | ```java 7 | AtomCache cache = new AtomCache(); 8 | cache.setUseMmCif(true); 9 | 10 | StructureIO.setAtomCache(cache); 11 | 12 | Structure structure = StructureIO.getStructure("1smt"); 13 | 14 | AsaCalculator asaCalc = new AsaCalculator(structure, 15 | AsaCalculator.DEFAULT_PROBE_SIZE, 16 | 1000, 1, false); 17 | 18 | GroupAsa[] groupAsas = asaCalc.getGroupAsas(); 19 | 20 | double tot = 0; 21 | 22 | for (GroupAsa groupAsa: groupAsas) { 23 | System.out.printf("%1s\t%5s\t%3s\t%6.2f\n", 24 | groupAsa.getGroup().getChainId(), 25 | groupAsa.getGroup().getResidueNumber(), 26 | groupAsa.getGroup().getPDBName(), 27 | groupAsa.getAsaU()); 28 | tot+=groupAsa.getAsaU(); 29 | } 30 | 31 | System.out.printf("Total area: %9.2f\n",tot); 32 | 33 | ``` 34 | See [DemoAsa](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoAsa.java) for a fully working demo. 35 | 36 | [Shrake 1973]: http://www.sciencedirect.com/science/article/pii/0022283673900119 37 | 38 | <!--automatically generated footer--> 39 | 40 | --- 41 | 42 | Navigation: 43 | [Home](../README.md) 44 | | [Book 3: The Structure Modules](README.md) 45 | | Chapter 11 : Accessible Surface Areas 46 | 47 | Prev: [Chapter 10 : External Databases](externaldb.md) 48 | 49 | Next: [Chapter 12 : Contacts Within a Chain and between Chains](contact-map.md) 50 | -------------------------------------------------------------------------------- /structure/bioassembly.md: -------------------------------------------------------------------------------- 1 | Asymmetric Unit and Biological Assembly 2 | ======================================= 3 | 4 | For many proteins, the asymmetric unit and the biological assembly are the same. However there are quite a few proteins where they are not identical and depending on what you are interested in, it might be important that you work with the biological assembly, instead of the asymmetric unit. 5 | 6 | ## Asymmetric Unit 7 | 8 | The asymmetric unit is the smallest portion of a crystal structure to which symmetry operations can be applied in order to generate the complete unit cell (the crystal repeating unit). 9 | 10 | A crystal asymmetric unit may contain: 11 | 12 | * one biological assembly 13 | * a portion of a biological assembly 14 | * multiple biological assemblies 15 | 16 | ## Biological Assembly 17 | 18 | The biological assembly (also sometimes referred to as the biological unit) is the macromolecular assembly that has either been shown to be or is believed to be the functional form of the molecule For example, the functional form of hemoglobin has four chains. 19 | 20 | The [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) and [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) classes in Biojava provide access methods to work with either asymmetric unit or biological assembly. 21 | 22 | Let's load both representations of hemoglobin PDB ID [1HHO](http://www.rcsb.org/pdb/explore.do?structureId=1hho) and visualize it: 23 | 24 | ```java 25 | public static void main(String[] args){ 26 | 27 | try { 28 | Structure asymUnit = StructureIO.getStructure("1hho"); 29 | 30 | showStructure(asymUnit); 31 | 32 | Structure bioAssembly = StructureIO.getBiologicalAssembly("1hho"); 33 | 34 | showStructure(bioAssembly); 35 | 36 | } catch (Exception e){ 37 | e.printStackTrace(); 38 | } 39 | 40 | } 41 | 42 | public static void showStructure(Structure structure){ 43 | 44 | StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol(); 45 | 46 | jmolPanel.setStructure(structure); 47 | 48 | // send some commands to Jmol 49 | jmolPanel.evalString("select * ; color chain;"); 50 | jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; "); 51 | jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;"); 52 | 53 | } 54 | ``` 55 | 56 | <table> 57 | <tr> 58 | <td> 59 | The <b>asymmetric unit</b> of hemoglobin PDB ID <a href="http://www.rcsb.org/pdb/explore.do?structureId=1hho">1HHO</a> 60 | </td> 61 | <td> 62 | The <b>biological assembly</b> of hemoglobin PDB ID <a href="http://www.rcsb.org/pdb/explore.do?structureId=1hho">1HHO</a> 63 | </td> 64 | </tr> 65 | <tr> 66 | <td> 67 | <img src="img/1hho_asym.png"/> 68 | </td> 69 | <td> 70 | <img src="img/1hho_biounit.png"/> 71 | </td> 72 | </tr> 73 | </table> 74 | 75 | As we can see, the two representations are quite different! When investigating protein interfaces, ligand binding and for many other applications, you always want to work with the biological assemblies. 76 | 77 | Here another example, the bacteriophave GA protein capsid PDB ID [1GAV](http://www.rcsb.org/pdb/explore.do?structureId=1gav) 78 | 79 | <table> 80 | <tr> 81 | <td> 82 | The <b>asymmetric unit</b> of bacteriophave GA protein capsid PDB ID <a href="http://www.rcsb.org/pdb/explore.do?structureId=1gav">1GAV</a> 83 | </td> 84 | <td> 85 | The <b>biological assembly</b> of bacteriophave GA protein capsid PDB ID <a href="http://www.rcsb.org/pdb/explore.do?structureId=1gav">1GAV</a> 86 | </td> 87 | </tr> 88 | <tr> 89 | <td> 90 | <img src="img/1gav_asym.png"/> 91 | </td> 92 | <td> 93 | <img src="img/1gav_biounit.png"/> 94 | </td> 95 | </tr> 96 | </table> 97 | 98 | ## Re-creating Biological Assemblies 99 | 100 | Since biological assemblies can be accessed via the StructureIO interface, in principle there is no need to access the lower-level code in BioJava that allows to re-create biological assemblies. If you are interested in looking at the gory details of this, here a couple of pointers into the code. In principle there are two ways for how to get to a biological assembly: 101 | 102 | 1. The biological assembly needs to be re-built and the atom coordinates of the asymmetric unit need to be rotated according to the instructions in the files. The information required to re-create the biological assemblies is available in both the PDB an mmCIF/PDBx files. In PDB files the relevant transformations are stored in the *REMARK 350* records. For mmCIF/PDBx, the *_pdbx_struct_assembly* and *_pdbx_struct_oper_list* categories store the corresponding rules. 103 | 104 | 2. There is also a pre-computed file available from the PDB that contains an assembled version of a structure. This file can be parsed directly, without having to perform rotation operations on coordinates. 105 | 106 | As of version 5.0 BioJava contains utility classes to re-create biological assemblies for both PDB and mmCIF files. 107 | 108 | Take a look at the method `getBiologicalAssembly()` in [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) to see how the underlying *BiologicalAssemblyBuilder* is called. 109 | 110 | ## Memory consumption 111 | 112 | This example in the next section loads the structure of the PBCV-1 virus capsid (PDB ID [1M4X](http://www.rcsb.org/pdb/explore.do?structureId=1m4x)). It consists of 16 million atoms and has one of the largest, if not the largest biological assembly that is currently available in the PDB. Needless to say it is important to change the maximum heap size parameter, otherwise you will not be able to load it. It requires a minimum of 9GB RAM to load (measured on Java 1.7 on OSX). You can change the heap size by providing the following startup parameter (and assuming you have 10G or more of RAM available on your system) 113 | <pre> 114 | -Xmx10G 115 | </pre> 116 | 117 | Note: when loading this structure with 9GB of memory, the Java VM spends a significant amount of time in garbage collection (GC). If you provide more RAM than the minimum requirement, then GC is triggered less often and the biological assembly loads faster. 118 | 119 | <table> 120 | <tr> 121 | <td> 122 | <img src="img/1m4x_bio_r_250.jpg"/> 123 | </td> 124 | </tr> 125 | <tr> 126 | <td> 127 | The biological assembly of the PBCV-1 virus capsid. (image source: <a href="http://www.rcsb.org/pdb/explore.do?structureId=1m4x">RCSB</a>) 128 | </td> 129 | </tr> 130 | </table> 131 | 132 | ## Representing symmetry related chains 133 | Chains are identified by chain identifiers which serve to distinguish the different molecular entities present in the asymmetric unit. Once a biological assembly is built it can be composed of chains from both the asymmetric unit or from chains resulting in applying a symmetry operator (this chains are also called "symmetry mates"). The problem with that is that the symmetry mates will get the same chain identifiers as the untransformed chains. 134 | 135 | In order to solve that issue there are 2 solutions: 136 | 137 | 1. Assign new chain identifiers. In BioJava the new chain identifiers assigned are of the form `<original chain id>_<symmetry operator id>` (the symmetry operator id is numerical and is the one in field `_pdbx_struct_oper_list.id` in the mmCIF file). 138 | 2. Place the symmetry partners into different models. This is the solution taken by the pre-computed biounit files available from the PDB. 139 | 140 | Since version 5.0 BioJava uses approach 1) to store the biounit in a single `Structure` object. Because the chain identifiers are then of more than 1 character, the Structure can only be written out in mmCIF format (PDB format is limited to 1 character chain identifiers). 141 | 142 | In BioJava one can still produce a biounit using approach 2) by passing a boolean parameter to the `getBiologicalAssembly` method: 143 | ```java 144 | Structure struct = StructureIO.getBiologicalAssembly(pdbId, true); 145 | ``` 146 | ## PDB entries with more than 1 biological assemblies 147 | Many PDB entries are assigned more than 1 biological assemblies. This is due to many factors: sometimes the authors disagree with the annotators, sometimes the authors are not sure about which biological assembly is the right one, sometimes there are several equivalent biological assemblies present in the asymmetric unit (but with slightly different conformations) and each of those is annotated as a different biological assembly. 148 | 149 | To get all biological assemblies for a given PDB entry one needs to use: 150 | ```java 151 | List<Structure> bioAssemblies = StructureIO.getBiologicalAssemblies(pdbId); 152 | ``` 153 | 154 | ## Further Reading 155 | 156 | The RCSB PDB web site has a great [tutorial on Biological Assemblies](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies). 157 | 158 | <!--automatically generated footer--> 159 | 160 | --- 161 | 162 | Navigation: 163 | [Home](../README.md) 164 | | [Book 3: The Structure Modules](README.md) 165 | | Chapter 9 : Biological Assemblies 166 | 167 | Prev: [Chapter 8 : Structure Alignments](alignment.md) 168 | 169 | Next: [Chapter 10 : External Databases](externaldb.md) 170 | -------------------------------------------------------------------------------- /structure/caching.md: -------------------------------------------------------------------------------- 1 | Local PDB Installations 2 | ======================= 3 | 4 | BioJava can automatically download and install most of the data files that it needs. Those downloads 5 | will happen only once. Future requests for the data file will re-use the local copy. 6 | 7 | The main class that provides this functionality is the [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html). 8 | 9 | It is hidden inside the StructureIO class, that we already encountered earlier. 10 | 11 | ```java 12 | Structure structure = StructureIO.getStructure("4hhb"); 13 | ``` 14 | 15 | is the same as 16 | 17 | ```java 18 | AtomCache cache = new AtomCache(); 19 | cache.getStructure("4hhb"); 20 | ``` 21 | 22 | 23 | ## Where Are the Files Written to? 24 | 25 | By default the AtomCache writes all files into a temporary location (The system temp directory "java.io.tempdir"). 26 | 27 | If you already have a local PDB installation, or you want to use a more permanent location to store the files, 28 | you can configure the AtomCache by setting the PDB_DIR system property 29 | 30 | <pre> 31 | -DPDB_DIR=/wherever/you/want/ 32 | </pre> 33 | 34 | BioJava will also check for a `PDB_DIR` environmental variable. If you launch BioJava from the command line, it can be useful to include `export PDB_DIR=/wherever/you/want` in your `.bashrc` file. 35 | 36 | An alternative is to hard-code the path in this way (but setting it as a property is better style) 37 | 38 | ```java 39 | AtomCache cache = new AtomCache(); 40 | 41 | cache.setPath("/path/to/pdb/files/"); 42 | ``` 43 | 44 | ## File Parsing Parameters 45 | 46 | The AtomCache also provides access to configuring various options that are available during the 47 | parsing of files. The [FileParsingParameters](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/FileParsingParameters.html) 48 | class is the main place to influence the level of detail and as a consequence the speed with which files can be loaded. 49 | 50 | This example turns on the use of chemical components when loading a `Structure`. (See also the [next chapter](chemcomp.md)) 51 | 52 | ```java 53 | AtomCache cache = new AtomCache(); 54 | 55 | cache.setPath("/tmp/"); 56 | 57 | FileParsingParameters params = cache.getFileParsingParams(); 58 | 59 | StructureIO.setAtomCache(cache); 60 | 61 | Structure structure = StructureIO.getStructure("4hhb"); 62 | 63 | ``` 64 | 65 | ## Caching of other SCOP, CATH 66 | 67 | The AtomCache not only provides access to PDB, it can also fetch Structure representations of protein domains, as defined by SCOP and CATH, and the algorithms Protein Domain Parser (PDP) and Domain Parser (DP). 68 | 69 | ```java 70 | // uses a SCOP domain definition 71 | Structure domain1 = StructureIO.getStructure("d4hhba_"); 72 | 73 | // Get a specific protein chain, note: chain IDs are case sensitive, PDB IDs are not. 74 | Structure chain1 = StructureIO.getStructure("4HHB.A"); 75 | 76 | ``` 77 | 78 | There are quite a number of external database IDs that are supported here. See the 79 | <a href="http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html#getStructure(java.lang.String)">AtomCache documentation</a> for more details on the supported options. 80 | 81 | The non-PDB files can be cached at a different location by setting the `PDB_CACHE_DIR` property (with `java -DPDB_CACHE_DIR=...`) or environmental variable. 82 | 83 | <!--automatically generated footer--> 84 | 85 | --- 86 | 87 | Navigation: 88 | [Home](../README.md) 89 | | [Book 3: The Structure Modules](README.md) 90 | | Chapter 4 : Local Installations 91 | 92 | Prev: [Chapter 3 : Structure Data Model](structure-data-model.md) 93 | 94 | Next: [Chapter 5 : Chemical Component Dictionary](chemcomp.md) 95 | -------------------------------------------------------------------------------- /structure/chemcomp.md: -------------------------------------------------------------------------------- 1 | The Chemical Component Dictionary 2 | ================================= 3 | 4 | The [Chemical Component Dictionary](http://www.wwpdb.org/ccd.html) is an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules. 5 | 6 | ### How Does BioJava Decide what Groups Are Amino Acids? 7 | 8 | BioJava utilizes the Chem. Comp. Dictionary to achieve a chemically correct representation of each group. To make it clear how this can work, let's take a look at how [Selenomethionine](http://en.wikipedia.org/wiki/Selenomethionine) and water is dealt with: 9 | 10 | ```java 11 | Structure structure = StructureIO.getStructure("1A62"); 12 | 13 | for (Chain chain : structure.getChains()){ 14 | for (Group group : chain.getAtomGroups()){ 15 | if ( group.getPDBName().equals("MSE") || group.getPDBName().equals("HOH")){ 16 | System.out.println(group.getPDBName() + " is a group of type " + group.getType()); 17 | } 18 | } 19 | } 20 | ``` 21 | 22 | This will give this output: 23 | 24 | <pre> 25 | MSE is a group of type amino 26 | MSE is a group of type amino 27 | MSE is a group of type amino 28 | HOH is a group of type hetatm 29 | HOH is a group of type hetatm 30 | HOH is a group of type hetatm 31 | ... 32 | </pre> 33 | 34 | As you can see, although MSE is flaged as HETATM in the PDB file, BioJava still represents it correctly as an amino acid. They key is that the [definition file for MSE](http://www.rcsb.org/pdb/files/ligand/MSE.cif) flags it as "L-PEPTIDE LINKING", which is being used by BioJava. 35 | 36 | Note: Selenomethionine is a naturally occurring amino acid containing selenium. It has the ID MSE in the Chemical Component Dictionary. 37 | 38 | 39 | ### How to Access Chemical Component Definitions 40 | 41 | By default BioJava will retrieve the full chemical component definitions provided by the PDB. That way BioJava makes sure that the user gets a correct representation e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues, etc. 42 | 43 | The behaviour is configurable by setting a property in the `ChemCompGroupFactory` singleton: 44 | 45 | 1. Use a minimal built-in set of **Chemical Component Definitions**. Will only deal with most frequent cases of chemical components. Does not guarantee a correct representation, but it is fast and does not require network access. 46 | ```java 47 | ChemCompGroupFactory.setChemCompProvider(new ReducedChemCompProvider()); 48 | ``` 49 | 2. Load all **Chemical Component Definitions** at startup (slow startup, but then no further delays later on, requires more memory) 50 | ```java 51 | ChemCompGroupFactory.setChemCompProvider(new AllChemCompProvider()); 52 | ``` 53 | 3. Fetch missing **Chemical Component Definitions** on the fly (small download and parsing delays every time a new chemical compound is found). Default behaviour since 4.2.0. Note that the chemical component files are cached in the local file system for subsequent uses. 54 | ```java 55 | ChemCompGroupFactory.setChemCompProvider(new DownloadChemCompProvider()); 56 | ``` 57 | 58 | 59 | <!--automatically generated footer--> 60 | 61 | --- 62 | 63 | Navigation: 64 | [Home](../README.md) 65 | | [Book 3: The Structure Modules](README.md) 66 | | Chapter 5 : Chemical Component Dictionary 67 | 68 | Prev: [Chapter 4 : Local Installations](caching.md) 69 | 70 | Next: [Chapter 6 : Work with mmCIF/PDBx Files](mmcif.md) 71 | -------------------------------------------------------------------------------- /structure/contact-map.md: -------------------------------------------------------------------------------- 1 | # Finding contacts between atoms in a protein: contact maps 2 | 3 | Contacts are a useful tool to analyse protein structures. They simplify the 3-Dimensional view of the structures into a 2-Dimensional set of contacts between its atoms or its residues. The representation of the contacts in a matrix is known as the contact map. Many protein structure analysis and prediction efforts are done by using contacts. For instance they can be useful for: 4 | 5 | + development of structural alignment algorithms [Holm 1993][] [Caprara 2004][] 6 | + automatic domain identification [Alexandrov 2003][] [Emmert-Streib 2007][] 7 | + structural modelling by extraction of contact-based empirical potentials [Benkert 2008][] 8 | + structure prediction via contact prediction from sequence information [Jones 2012][] 9 | 10 | ## Getting the contact map of a protein chain 11 | 12 | This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT): 13 | 14 | ```java 15 | AtomCache cache = new AtomCache(); 16 | StructureIO.setAtomCache(cache); 17 | 18 | Structure structure = StructureIO.getStructure("1SMT"); 19 | 20 | Chain chain = structure.getChainByPDB("A"); 21 | 22 | // we want contacts between Calpha atoms only 23 | String[] atoms = {" CA "}; 24 | // the distance cutoff we use is 8A 25 | AtomContactSet contacts = StructureTools.getAtomsInContact(chain, atoms, 8.0); 26 | 27 | System.out.println("Total number of CA-CA contacts: "+contacts.size()); 28 | 29 | 30 | ``` 31 | 32 | The algorithm to find the contacts uses spatial hashing without need to calculate a full distance matrix, thus it scales nicely. 33 | 34 | ## Getting the contacts between two protein chains 35 | 36 | One can also find the contacting atoms between two protein chains. For instance the following code finds the contacts between the first 2 chains of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT): 37 | 38 | ```java 39 | AtomCache cache = new AtomCache(); 40 | StructureIO.setAtomCache(cache); 41 | 42 | Structure structure = StructureIO.getStructure("1SMT"); 43 | 44 | AtomContactSet contacts = 45 | StructureTools.getAtomsInContact(structure.getChain(0), structure.getChain(1), 5, false); 46 | 47 | System.out.println("Total number of atom contacts: "+contacts.size()); 48 | 49 | // the list of atom contacts can be reduced to a list of contacts between groups: 50 | GroupContactSet groupContacts = new GroupContactSet(contacts); 51 | ``` 52 | 53 | 54 | See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above. 55 | 56 | 57 | 58 | [Holm 1993]: http://www.biomedcentral.com/pubmed/8377180 59 | [Caprara 2004]: http://www.biomedcentral.com/pubmed/15072687 60 | [Alexandrov 2003]: http://www.biomedcentral.com/pubmed/12584135 61 | [Emmert-Streib 2007]: http://www.biomedcentral.com/pubmed/17608939 62 | [Benkert 2008]: http://www.biomedcentral.com/pubmed/17932912 63 | [Jones 2012]: http://www.ncbi.nlm.nih.gov/pubmed/22101153 64 | 65 | <!--automatically generated footer--> 66 | 67 | --- 68 | 69 | Navigation: 70 | [Home](../README.md) 71 | | [Book 3: The Structure Modules](README.md) 72 | | Chapter 12 : Contacts Within a Chain and between Chains 73 | 74 | Prev: [Chapter 11 : Accessible Surface Areas](asa.md) 75 | 76 | Next: [Chapter 13 - Finding all Interfaces in Crystal: Crystal Contacts](crystal-contacts.md) 77 | -------------------------------------------------------------------------------- /structure/crystal-contacts.md: -------------------------------------------------------------------------------- 1 | # How to find all crystal contacts in a PDB structure 2 | 3 | ## Why crystal contacts? 4 | 5 | A protein structure is determined by X-ray diffraction from a protein crystal, i.e. an infinite lattice of molecules. Thus the end result of the diffraction experiment is a crystal lattice and not just a single molecule. However the PDB file only contains the coordinates of the Asymmetric Unit (AU), defined as the minimum unit needed to reconstruct the full crystal using symmetry operators. 6 | 7 | Looking at the AU alone is not enough to understand the crystal structure. For instance the biologically relevant assembly (known as the Biological Unit) can occur through a symmetry operator that can be found looking at the crystal contacts. See for instance [1M4N](http://www.rcsb.org/pdb/explore.do?structureId=1M4N): its biological unit is a dimer that happens through a 2-fold operator and is the largest interface found in the crystal. 8 | 9 | Looking at crystal contacts can also be important in order to assess the quality and reliability of the deposited PDB model: an AU can look perfectly fine but then upon reconstruction of the lattice the molecules can be clashing, which indicates that something is wrong in the model. 10 | 11 | 12 | ## Getting the set of unique contacts in the crystal lattice 13 | 14 | This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT): 15 | 16 | ```java 17 | AtomCache cache = new AtomCache(); 18 | 19 | StructureIO.setAtomCache(cache); 20 | 21 | Structure structure = StructureIO.getStructure("1SMT"); 22 | 23 | CrystalBuilder cb = new CrystalBuilder(structure); 24 | 25 | // 6 is the distance cutoff to consider 2 atoms in contact 26 | StructureInterfaceList interfaces = cb.getUniqueInterfaces(6); 27 | 28 | System.out.println("The crystal contains "+interfaces.size()+" unique interfaces"); 29 | 30 | // this calculates the buried surface areas of all interfaces and sorts them by areas 31 | interfaces.calcAsas(3000, 1, -1); 32 | 33 | // we can get the largest interface in the crystal and look at its area 34 | interfaces.get(1).getTotalArea(); 35 | 36 | ``` 37 | 38 | An interface is defined here as any 2 chains with at least a pair of atoms within the given distance cutoff (6 A in the example above). 39 | 40 | The algorithm to find all unique interfaces in the crystal works roughly like this: 41 | + Reconstructs the full unit cell by applying the matrix operators of the corresponding space group to the Asymmetric Unit. 42 | + Searches all cells around the original one by applying crystal translations, if any 2 chains in that search is found to contact then the new contact is added to the final list. 43 | + The search is performend without repeating redundant symmetry operators, making sure that if a contact is found then it is a unique contact. 44 | 45 | See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above. 46 | 47 | ## Clustering the interfaces 48 | One can also cluster the interfaces based on their similarity. The similarity is measured through contact overlap: number of common contacts over average number of contact in both chains. The clustering can be done as following: 49 | 50 | ```java 51 | List<StructureInterfaceCluster> clusters = interfaces.getClusters(); 52 | for (StructureInterfaceCluster cluster:clusters) { 53 | System.out.print("Cluster "+cluster.getId()+" members: "); 54 | for (StructureInterface member:cluster.getMembers()) { 55 | System.out.print(member.getId()+" "); 56 | } 57 | System.out.println(); 58 | } 59 | ``` 60 | 61 | 62 | <!--automatically generated footer--> 63 | 64 | --- 65 | 66 | Navigation: 67 | [Home](../README.md) 68 | | [Book 3: The Structure Modules](README.md) 69 | | Chapter 13 - Finding all Interfaces in Crystal: Crystal Contacts 70 | 71 | Prev: [Chapter 12 : Contacts Within a Chain and between Chains](contact-map.md) 72 | 73 | Next: [Chapter 14 : Protein Symmetry](symmetry.md) 74 | -------------------------------------------------------------------------------- /structure/externaldb.md: -------------------------------------------------------------------------------- 1 | External Databases 2 | ================== 3 | 4 | Biojava provides access to a number of external structural databases. These often use [caching](caching.md) to reduce the amount of data which must be downloaded from the database. 5 | 6 | SCOP 7 | ---- 8 | 9 | <table> 10 | <tr><td> 11 | <img src="img/1dan_scop.png" width=300 /> 12 | </td> 13 | <td> 14 | (Top) The structure <a href="http://www.rcsb.org/pdb/explore.do?structureId=1dan">1dan</a> contains four chains. <br/> 15 | 16 | (Bottom) These chains are broken up into six SCOP domains. The green chain L becomes 3 domains, while a combination of chains U (red) and T (orange) go to form the central purpal domain. 17 | </td> 18 | </tr> 19 | </table> 20 | 21 | The Structural Classification of Proteins (SCOP) is a manually curated classification of protein structural domains. It provides two pieces of data: 22 | 23 | * The breakdown of a protein into structural domains 24 | * A classification of domains according to their structure. 25 | 26 | The structure for a known SCOP domain can be fetched via its 7-letter domain ID (eg 'd2bq6a1') via ```StructureIO.getStructure()```, as described in [Local PDB Installations](caching.md#Caching of other SCOP, CATH). 27 | 28 | The SCOP classification can be accessed through the [```ScopDatabase```](http://www.biojava.org/docs/api/org/biojava/nbio/structure/scop/ScopDatabase.html) class. 29 | 30 | ```java 31 | ScopDatabase scop = ScopFactory.getSCOP(); 32 | ``` 33 | 34 | ### Inspecting SCOP domains 35 | 36 | A list of domains can be retrieved for a given protein. 37 | 38 | ```java 39 | List<ScopDomain> domains = scop.getDomainsForPDB("4HHB"); 40 | ``` 41 | 42 | You can get lots of useful information from the [```ScopDomain```](http://www.biojava.org/docs/api/org/biojava/nbio/structure/scop/ScopDomain.html) object. 43 | 44 | ScopDomain domain = domains.get(0); 45 | String scopID = domain.getScopId(); // d4hhba_ 46 | String classification = domain.getClassificationId(); // a.1.1.2 47 | int sunId = domain.getSunId(); // 15251 48 | 49 | ### Viewing the SCOP hierarchy 50 | 51 | The full hierarchy is available as a tree of [```ScopNode```](http://www.biojava.org/docs/api/org/biojava/nbio/structure/scop/ScopNode.html)s, which can be easily traversed using their ```getParentSunid()``` and ```getChildren()``` methods. 52 | 53 | ```java 54 | ScopNode node = scop.getScopNode(sunId); 55 | while (node != null){ 56 | System.out.println(scop.getScopDescriptionBySunid(node.getSunid())); 57 | node = scop.getScopNode(node.getParentSunid()); 58 | } 59 | ``` 60 | 61 | ScopDatabase also provides access to all nodes at a particular level. 62 | 63 | ```java 64 | List<ScopDescription> superfams = scop.getByCategory(ScopCategory.Superfamily); 65 | System.out.println("Total nr. of superfamilies:" + superfams.size()); 66 | ``` 67 | 68 | ### Types of ScopDatabase 69 | 70 | Several types of ```ScopDatabase``` are available. These can be instantiated manually when more control is needed. 71 | 72 | * __RemoteScopInstallation__ (default) Fetches data one node at a time from the internet. Useful when perfoming a small number of operations. 73 | * __ScopeInstallation__ Downloads all SCOP data as a batch and caches it for later use. Much faster when performing many operations. 74 | 75 | Several internal BioJava classes use ```ScopFactory.getSCOP()``` when they encounter references to SCOP domains, so it is always a good idea to notify the ```ScopFactory``` when using a custom ```ScopDatabase``` instance. 76 | 77 | ```java 78 | ScopDatabase scop = new ScopInstallation(); 79 | ScopFactory.setScopDatabase(scop); 80 | ``` 81 | Several versions of SCOP are available. 82 | 83 | ```java 84 | // Use Steven Brenner's updated version of SCOP 85 | scop = ScopFactory.getSCOP(ScopFactory.VERSION_1_75C); 86 | // Use an old version globally, perhaps for an older benchmark 87 | ScopFactory.setScopDatabase(ScopFactory.VERSION_1_69); 88 | ``` 89 | 90 | CATH 91 | ---- 92 | 93 | Cath can be accessed in a very similar fashion to SCOP. In parallel to the ScopInstallation class, there is a CathInstallation. Also, the StructureIO class allows to request by CATH ID. 94 | 95 | ```java 96 | 97 | private static final String DEFAULT_SCRIPT ="select * ; cartoon on; spacefill off; wireframe off; select ligands; wireframe on; spacefill on;"; 98 | 99 | private static final String[] colors = new String[]{"red","green","blue","yellow"}; 100 | 101 | public static void main(String args[]){ 102 | 103 | UserConfiguration config = new UserConfiguration(); 104 | config.setPdbFilePath("/tmp/"); 105 | 106 | String pdbID = "1DAN"; 107 | 108 | CathDatabase cath = new CathInstallation(config.getPdbFilePath()); 109 | 110 | List<CathDomain> domains = cath.getDomainsForPdb(pdbID); 111 | 112 | try { 113 | 114 | // show the structure in 3D 115 | BiojavaJmol jmol = new BiojavaJmol(); 116 | jmol.setStructure(StructureIO.getStructure(pdbID)); 117 | jmol.evalString(DEFAULT_SCRIPT); 118 | 119 | System.out.println("got " + domains.size() + " domains"); 120 | 121 | // now color the domains on the structure 122 | int colorpos = -1; 123 | 124 | for ( CathDomain domain : domains){ 125 | 126 | colorpos++; 127 | 128 | showDomain(jmol, domain,colorpos); 129 | } 130 | 131 | 132 | } catch (Exception e) { 133 | // TODO Auto-generated catch block 134 | e.printStackTrace(); 135 | } 136 | 137 | } 138 | 139 | 140 | 141 | private static void showDomain(BiojavaJmol jmol, CathDomain domain, int colorpos) { 142 | List<CathSegment> segments = domain.getSegments(); 143 | 144 | StructureName key = new StructureName(domain.getDomainName()); 145 | String chainId = key.getChainId(); 146 | 147 | String color = colors[colorpos]; 148 | 149 | System.out.println(" * domain " + domain.getDomainName() + " has # segments: " + domain.getSegments().size() + " color: " + color); 150 | 151 | for ( CathSegment segment : segments){ 152 | System.out.println(" * " + segment); 153 | String start = segment.getStart(); 154 | 155 | String stop = segment.getStop(); 156 | 157 | String script = "select " + start + "-" + stop+":"+chainId + "; color " + color +";"; 158 | 159 | jmol.evalString(script ); 160 | } 161 | 162 | } 163 | ``` 164 | 165 | 166 | <table> 167 | <tr> 168 | <td> 169 | This will show the following 170 | </td> 171 | 172 | <td> 173 | and the text: 174 | </td> 175 | </tr> 176 | 177 | <tr> 178 | <td> 179 | <img src="img/cath_1dan.png" width=300 /> 180 | </td> 181 | <td> 182 | <pre> 183 | 184 | got 4 domains 185 | * domain 1danH01 has # segments: 2 color: red 186 | * CathSegment [segmentId=1, start=16, stop=27, length=12, sequenceHeader=null, sequence=null] 187 | * CathSegment [segmentId=2, start=121, stop=232, length=112, sequenceHeader=null, sequence=null] 188 | * domain 1danH02 has # segments: 2 color: green 189 | * CathSegment [segmentId=1, start=28, stop=120, length=93, sequenceHeader=null, sequence=null] 190 | * CathSegment [segmentId=2, start=233, stop=246, length=14, sequenceHeader=null, sequence=null] 191 | * domain 1danU00 has # segments: 1 color: blue 192 | * CathSegment [segmentId=1, start=91, stop=210, length=120, sequenceHeader=null, sequence=null] 193 | * domain 1danT00 has # segments: 1 color: yellow 194 | * CathSegment [segmentId=1, start=6, stop=80, length=75, sequenceHeader=null, sequence=null] 195 | </pre> 196 | </td> 197 | </tr> 198 | </table> 199 | 200 | 201 | 202 | <!--automatically generated footer--> 203 | 204 | --- 205 | 206 | Navigation: 207 | [Home](../README.md) 208 | | [Book 3: The Structure Modules](README.md) 209 | | Chapter 10 : External Databases 210 | 211 | Prev: [Chapter 9 : Biological Assemblies](bioassembly.md) 212 | 213 | Next: [Chapter 11 : Accessible Surface Areas](asa.md) 214 | -------------------------------------------------------------------------------- /structure/firststeps.md: -------------------------------------------------------------------------------- 1 | First Steps 2 | =========== 3 | 4 | ## First Steps 5 | 6 | The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class. 7 | 8 | ```java 9 | public static void main(String[] args) throws Exception { 10 | Structure structure = StructureIO.getStructure("4HHB"); 11 | // and let's print out how many atoms are in this structure 12 | System.out.println(StructureTools.getNrAtoms(structure)); 13 | } 14 | ``` 15 | 16 | BioJava automatically downloads the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copies it into a temporary location. Then the PDB file parser loads the data into a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) object, that provides access to the content in the file. (If you call this a second time, BioJava will automatically re-use the local file.) 17 | 18 | <table> 19 | <tr> 20 | <td> 21 | <a href="http://www.rcsb.org/pdb/explore.do?structureId=4hhb"><img src="img/4hhb_bio_r_250.jpg"/></a> 22 | </td> 23 | <td> 24 | The crystal structure of human deoxyhaemoglobin PDB ID <a href="http://www.rcsb.org/pdb/explore.do?structureId=4hhb">4HHB</a> (image source: <a href="http://www.rcsb.org/pdb/explore.do?structureId=4hhb">RCSB</a>) 25 | </tr> 26 | </table> 27 | 28 | This demonstrates two things: 29 | 30 | + BioJava can automatically download and install files locally (more on this in Chapter 4) 31 | + BioJava by default writes those files into a temporary location (The system temp directory "java.io.tempdir"). 32 | 33 | If you already have a local PDB installation, you can configure where BioJava should read the files from by setting the PDB_DIR system property 34 | 35 | <pre> 36 | -DPDB_DIR=/wherever/you/want/ 37 | </pre> 38 | 39 | ## Memory Consumption 40 | 41 | Talking about startup properties, it is also good to mention the fact that many PDB entries are large molecules and the default 64k memory allowance for Java applications is not sufficient in many cases. BioJava contains several built-in caches which automatically adjust to the available memory. As such, the more memory you grant your Java applicaiton, the better it can utilize the caches and the better the performance will be. Change the maximum heap space of your Java VM with this startup parameter: 42 | 43 | <pre> 44 | -Xmx1G 45 | </pre> 46 | 47 | ## A Quick 3D View 48 | 49 | If you have the *biojava-structure-gui* module installed, you can quickly visualise a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) via this: 50 | 51 | ```java 52 | public static void main(String[] args) throws Exception { 53 | Structure struc = StructureIO.getStructure("4hhb"); 54 | 55 | StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol(); 56 | 57 | jmolPanel.setStructure(struc); 58 | 59 | // send some commands to Jmol 60 | jmolPanel.evalString("select * ; color chain;"); 61 | jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; "); 62 | jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;"); 63 | } 64 | ``` 65 | 66 | This will result in the following view: 67 | 68 | <table> 69 | <tr> 70 | <td> 71 | <img src="img/4hhb_jmol.png"/> 72 | </td> 73 | <td> 74 | The <a href="http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/gui/jmol/StructureAlignmentJmol.html">StructureAlignmentJmol</a> class provides a wrapper for the <a href="http://jmol.sourceforge.net/">Jmol</a> viewer and provides a bridge to BioJava, so Structure objects can be sent to Jmol for visualisation. 75 | </td> 76 | </tr> 77 | </table> 78 | 79 | ## Asymmetric Unit and Biological Assembly 80 | 81 | By default many people work with the *asymmetric unit* of a protein. However for many studies the correct representation to look at is the *biological assembly* of a protein. You can request it by calling 82 | 83 | ```java 84 | public static void main(String[] args) throws Exception { 85 | Structure structure = StructureIO.getBiologicalAssembly("1GAV"); 86 | // and let's print out how many atoms are in this structure 87 | System.out.println(StructureTools.getNrAtoms(structure)); 88 | } 89 | ``` 90 | 91 | This topic is important, so we dedicated a [whole chapter](bioassembly.md) to it. 92 | 93 | ## I Loaded a Structure Object, What Now? 94 | 95 | BioJava provides a number of algorithms and visualisation tools that you can use to further analyse the structure, or look at it. Here a couple of suggestions for further reads: 96 | 97 | + [The BioJava Cookbook for protein structures](http://biojava.org/wiki/BioJava:CookBook#Protein_Structure) 98 | + How does BioJava [represent the content](structure-data-model.md) of a PDB/mmCIF file? 99 | + How to calculate a protein structure alignment using BioJava: [tutorial](alignment.md) or [cookbook](http://biojava.org/wiki/BioJava:CookBook:PDB:align) 100 | + [How to work with Groups (AminoAcid, Nucleotide, Hetatom)](http://biojava.org/wiki/BioJava:CookBook:PDB:groups) 101 | 102 | 103 | 104 | 105 | <!--automatically generated footer--> 106 | 107 | --- 108 | 109 | Navigation: 110 | [Home](../README.md) 111 | | [Book 3: The Structure Modules](README.md) 112 | | Chapter 2 : First Steps 113 | 114 | Prev: [Chapter 1 : Installation](installation.md) 115 | 116 | Next: [Chapter 3 : Structure Data Model](structure-data-model.md) 117 | -------------------------------------------------------------------------------- /structure/img/143px-Selenomethionine-from-xtal-3D-balls.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/143px-Selenomethionine-from-xtal-3D-balls.png -------------------------------------------------------------------------------- /structure/img/1cfd_1cll_fatcat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1cfd_1cll_fatcat.png -------------------------------------------------------------------------------- /structure/img/1cfd_1cll_fatcat.xcf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1cfd_1cll_fatcat.xcf -------------------------------------------------------------------------------- /structure/img/1cfd_1cll_flexible.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1cfd_1cll_flexible.png -------------------------------------------------------------------------------- /structure/img/1cfd_1cll_rigid.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1cfd_1cll_rigid.png -------------------------------------------------------------------------------- /structure/img/1dan_scop.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1dan_scop.png -------------------------------------------------------------------------------- /structure/img/1gav_asym.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1gav_asym.png -------------------------------------------------------------------------------- /structure/img/1gav_biounit.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1gav_biounit.png -------------------------------------------------------------------------------- /structure/img/1hho_asym.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1hho_asym.png -------------------------------------------------------------------------------- /structure/img/1hho_biounit.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1hho_biounit.png -------------------------------------------------------------------------------- /structure/img/1m4x_bio_r_250.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/1m4x_bio_r_250.jpg -------------------------------------------------------------------------------- /structure/img/2hyn_1zll.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/2hyn_1zll.png -------------------------------------------------------------------------------- /structure/img/3cna.A_2pel.A_cecp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/3cna.A_2pel.A_cecp.png -------------------------------------------------------------------------------- /structure/img/4hhb_bio_r_250.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/4hhb_bio_r_250.jpg -------------------------------------------------------------------------------- /structure/img/4hhb_jmol.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/4hhb_jmol.png -------------------------------------------------------------------------------- /structure/img/alignment_gui.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/alignment_gui.png -------------------------------------------------------------------------------- /structure/img/alignmentpanel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/alignmentpanel.png -------------------------------------------------------------------------------- /structure/img/cath_1dan.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/cath_1dan.png -------------------------------------------------------------------------------- /structure/img/database_search.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/database_search.png -------------------------------------------------------------------------------- /structure/img/database_search_results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/database_search_results.png -------------------------------------------------------------------------------- /structure/img/multiple_gui.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/multiple_gui.png -------------------------------------------------------------------------------- /structure/img/multiple_jmol_globins.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/multiple_jmol_globins.png -------------------------------------------------------------------------------- /structure/img/multiple_panel_globins.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/multiple_panel_globins.png -------------------------------------------------------------------------------- /structure/img/symm_combined.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_combined.png -------------------------------------------------------------------------------- /structure/img/symm_helical.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_helical.png -------------------------------------------------------------------------------- /structure/img/symm_hierarchy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_hierarchy.png -------------------------------------------------------------------------------- /structure/img/symm_internal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_internal.png -------------------------------------------------------------------------------- /structure/img/symm_local.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_local.png -------------------------------------------------------------------------------- /structure/img/symm_pg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_pg.png -------------------------------------------------------------------------------- /structure/img/symm_pseudo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_pseudo.png -------------------------------------------------------------------------------- /structure/img/symm_subunits.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/structure/img/symm_subunits.png -------------------------------------------------------------------------------- /structure/installation.md: -------------------------------------------------------------------------------- 1 | ## Quick Installation 2 | 3 | In the beginning, just one quick paragraph of how to get access to BioJava. 4 | 5 | BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way: 6 | 7 | BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html) guide. 8 | 9 | As of version 4, BioJava is available in maven central. This is all you would need to add a BioJava dependency to your projects: 10 | 11 | ```xml 12 | <dependencies> 13 | ... 14 | <dependency> 15 | <!-- This imports the latest SNAPSHOT builds from the protein structure modules of BioJava. 16 | --> 17 | <groupId>org.biojava</groupId> 18 | <artifactId>biojava-structure</artifactId> 19 | <version>4.2.0</version> 20 | </dependency> 21 | <!-- if you want to use the visualisation tools you need also this one: --> 22 | <dependency> 23 | <groupId>org.biojava</groupId> 24 | <artifactId>biojava-structure-gui</artifactId> 25 | <version>4.2.0</version> 26 | </dependency> 27 | <!-- other biojava jars as needed --> 28 | </dependencies> 29 | ``` 30 | 31 | If you run 32 | 33 | <pre> 34 | mvn package 35 | </pre> 36 | 37 | on your project, the BioJava dependencies will be automatically downloaded and installed for you. 38 | 39 | ### (Optional) Configuration 40 | 41 | BioJava can be configured through several properties: 42 | 43 | | Property | Description | 44 | | --- | --- | 45 | | `PDB_DIR` | Directory for caching structure files from the PDB. Mirrors the PDB's FTP server directory structure, with `PDB_DIR` equivalent to ftp://ftp.wwpdb.org/pub/pdb/. Default: temp directory | 46 | | `PDB_CACHE_DIR` | Cache directory for other files related to the structure package. Default: temp directory | 47 | 48 | These can be set either as java properties or as environmental variables. For example: 49 | 50 | ``` 51 | # This could be added to .bashrc 52 | export PDB_DIR=... 53 | # Or override for a particular execution 54 | java -DPDB_DIR=... -cp ... 55 | ``` 56 | 57 | Note that your IDE may ignore `.bashrc` settings, but should have a preference for passing VM arguments. 58 | 59 | <!--automatically generated footer--> 60 | 61 | --- 62 | 63 | Navigation: 64 | [Home](../README.md) 65 | | [Book 3: The Structure Modules](README.md) 66 | | Chapter 1 : Installation 67 | 68 | Next: [Chapter 2 : First Steps](firststeps.md) 69 | -------------------------------------------------------------------------------- /structure/lists.md: -------------------------------------------------------------------------------- 1 | # Lists of PDB IDs and PDB Status Information 2 | 3 | ## Get a list of all current PDB IDs 4 | 5 | The following code connects to one of the PDB servers and fetches a list of all current PDB IDs. 6 | 7 | ```java 8 | SortedSet<String> currentPDBIds = PDBStatus.getCurrentPDBIds(); 9 | ``` 10 | 11 | ## The current status of a PDB entry 12 | 13 | The following provides information about the status of a PDB entry 14 | 15 | ```java 16 | Status status = PDBStatus.getStatus("4hhb"); 17 | 18 | // get the current ID for an obsolete entry 19 | String currentID = PDBStatus.getCurrent("1hhb"); 20 | ``` 21 | 22 | 23 | <!--automatically generated footer--> 24 | 25 | --- 26 | 27 | Navigation: 28 | [Home](../README.md) 29 | | [Book 3: The Structure Modules](README.md) 30 | | Chapter 18 : Status Information 31 | 32 | Prev: [Chapter 17 : Special Cases](special.md) 33 | -------------------------------------------------------------------------------- /structure/mmcif.md: -------------------------------------------------------------------------------- 1 | # How to Parse mmCIF Files using BioJava 2 | 3 | A quick tutorial how to work with mmCIF files. 4 | 5 | ## What is mmCIF? 6 | 7 | The Protein Data Bank (PDB) has been distributing its archival files as PDB files for a long time. The PDB file format is based on "punchcard"-style rules how to store data in a flat file. With the increasing complexity of macromolecules that are being resolved experimentally, this file format can not be used any more to represent some or the more complex structures. As such, the wwPDB recently announced the transition from PDB to mmCIF/PDBx as the principal deposition and dissemination file format (see 8 | [here](http://www.wwpdb.org/news/news_2013.html#22-May-2013) and 9 | [here](http://wwpdb.org/workshop/wgroup.html)). 10 | 11 | The mmCIF file format has been around for some time (see [Westbrook 2000][] and [Westbrook 2003][] ) [BioJava](http://www.biojava.org) has been supporting mmCIF already for several years. This tutorial is meant to provide a quick introduction into how to parse mmCIF files using [BioJava](http://www.biojava.org) 12 | 13 | ## The Basics 14 | 15 | BioJava uses the [CIFTools-java](https://github.com/rcsb/ciftools-java) library to parse mmCIF. BioJava then has its own data model that reads PDB and mmCIF files 16 | into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](chemcomp.md)). 17 | If you don't want to use that data model, you can still use the CIFTools-java parser, please refer to its documentation. 18 | Let's start first with the most basic way of loading a protein structure. 19 | 20 | 21 | ## First Steps 22 | 23 | The simplest way to load a PDBx/mmCIF file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIO.html) class. 24 | 25 | ```java 26 | Structure structure = StructureIO.getStructure("4HHB"); 27 | // and let's print out how many atoms are in this structure 28 | System.out.println(StructureTools.getNrAtoms(structure)); 29 | ``` 30 | 31 | BioJava automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things: 32 | 33 | + BioJava can automatically download and install files locally 34 | + BioJava by default writes those files into a temporary location (The system temp directory "java.io.tempdir"). 35 | 36 | If you already have a local PDB installation, you can configure where BioJava should read the files from by setting the PDB_DIR system property 37 | 38 | <pre> 39 | -DPDB_DIR=/wherever/you/want/ 40 | </pre> 41 | 42 | ## Switching AtomCache to use different file types 43 | 44 | By default BioJava is using the BCIF file format for parsing data. In order to switch it to use mmCIF, we can take control over 45 | the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which 46 | manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations. 47 | 48 | ```java 49 | AtomCache cache = new AtomCache(); 50 | 51 | cache.setFiletype(StructureFiletype.CIF); 52 | 53 | // if you struggled to set the PDB_DIR property correctly in the previous step, 54 | // you could set it manually like this: 55 | cache.setPath("/tmp/"); 56 | 57 | StructureIO.setAtomCache(cache); 58 | 59 | Structure structure = StructureIO.getStructure("4HHB"); 60 | 61 | // and let's count how many chains are in this structure. 62 | System.out.println(structure.getChains().size()); 63 | ``` 64 | 65 | See other supported file types in the `StructureFileType` enum. 66 | 67 | ## URL based parsing of files 68 | 69 | StructureIO can also access files via URLs and fetch the data dynamically. E.g. the following code shows how to load a file from a remote server. 70 | 71 | ```java 72 | String u = "http://ftp.wwpdb.org/pub/pdb/data/biounit/mmCIF/divided/nw/4nwr-assembly1.cif.gz"; 73 | Structure s = StructureIO.getStructure(u); 74 | System.out.println(s); 75 | ``` 76 | 77 | ### Local URLs 78 | BioJava can also access local files, by specifying the URL as 79 | 80 | <pre> 81 | file:///path/to/local/file 82 | </pre> 83 | 84 | 85 | ## Low Level Access 86 | 87 | You can load a BioJava `Structure` object using the ciftools-java parser with: 88 | 89 | ```java 90 | InputStream inStream = new FileInputStream(fileName); 91 | // now get the protein structure. 92 | Structure cifStructure = CifStructureConverter.fromInputStream(inStream); 93 | ``` 94 | 95 | ## I Loaded a Structure Object, What Now? 96 | 97 | BioJava provides a number of algorithms and visualisation tools that you can use to further analyse the structure, or look at it. Here a couple of suggestions for further reads: 98 | 99 | + [The BioJava Cookbook for protein structures](http://biojava.org/wiki/BioJava:CookBook#Protein_Structure) 100 | + How does BioJava [represent the content](structure-data-model.md) of a PDB/mmCIF file? 101 | + How to calculate a protein structure alignment using BioJava: [tutorial](alignment.md) or [cookbook](http://biojava.org/wiki/BioJava:CookBook:PDB:align) 102 | + [How to work with Groups (AminoAcid, Nucleotide, Hetatom)](http://biojava.org/wiki/BioJava:CookBook:PDB:groups) 103 | 104 | ## Further reading 105 | 106 | See the [http://mmcif.rcsb.org/](http://mmcif.rcsb.org/) site for more documentation on mmcif. 107 | 108 | 109 | <!-- References --> 110 | 111 | 112 | [Westbrook 2000]: http://www.ncbi.nlm.nih.gov/pubmed/10842738 "Westbrook JD and Bourne PE. STAR/mmCIF: an ontology for macromolecular structure. Bioinformatics 2000 Feb; 16(2) 159-68. pmid:10842738." 113 | 114 | [Westbrook 2003]: http://www.ncbi.nlm.nih.gov/pubmed/12647386 "Westbrook JD and Fitzgerald PM. The PDB format, mmCIF, and other data formats. Methods Biochem Anal 2003; 44 161-79. pmid:12647386." 115 | 116 | 117 | <!--automatically generated footer--> 118 | 119 | --- 120 | 121 | Navigation: 122 | [Home](../README.md) 123 | | [Book 3: The Structure Modules](README.md) 124 | | Chapter 6 : Work with mmCIF/PDBx Files 125 | 126 | Prev: [Chapter 5 : Chemical Component Dictionary](chemcomp.md) 127 | 128 | Next: [Chapter 7 : SEQRES and ATOM Records](seqres.md) 129 | -------------------------------------------------------------------------------- /structure/secstruc.md: -------------------------------------------------------------------------------- 1 | Protein Secondary Structure 2 | =========================== 3 | 4 | ## What is Protein Secondary Structure? 5 | 6 | Protein secondary structure (SS) is the general three-dimensional form of local segments of proteins. 7 | Secondary structure can be formally defined by the pattern of hydrogen bonds of the protein 8 | (such as alpha helices and beta sheets) that are observed in an atomic-resolution structure. 9 | 10 | More specifically, the secondary structure is defined by the patterns of hydrogen bonds formed between 11 | amine hydrogen (-NH) and carbonyl oxygen (C=O) atoms contained in the backbone peptide bonds of the protein. 12 | 13 | For more info see the Wikipedia article 14 | on [protein secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure). 15 | 16 | ## Secondary Structure Annotation 17 | 18 | ### Information Sources 19 | 20 | There are various ways to obtain the SS annotation of a protein structure: 21 | 22 | - **Authors assignment**: the authors of the structure describe the SS, usually identifying helices 23 | and beta-sheets, and they assign the corresponding type to each residue involved. The authors assignment 24 | can be found in the `PDB` and `mmCIF` file formats deposited in the PDB, and it can be parsed in **BioJava** 25 | when a `Structure` is loaded. 26 | 27 | - **Assignment from Atom coordinates**: there exist various programs to assign the SS of a protein. 28 | The algorithms use the atom coordinates of the aminoacids to determine hydrogen bonds and geometrical patterns 29 | that define the different types of protein secondary structure. One of the first and most popular algorithms 30 | is `DSSP` (Dictionary of Secondary Structure of Proteins). **BioJava** has an implementation of the algorithm, 31 | written originally in C++, which will be described in the next section. 32 | 33 | - **Prediction from sequence**: Other algorithms use only the aminoacid sequence (primary structure) of the protein, 34 | nd predict the SS using the SS propensities of each aminoacid and multiple alignments with homologous sequences 35 | (i.e. [PSIPRED](http://bioinf.cs.ucl.ac.uk/psipred/)). At the moment **BioJava** does not have an implementation 36 | of this type, which would be more suitable for the sequence and alignment modules. 37 | 38 | ### Secondary Structure Types 39 | 40 | Following the `DSSP` convention, **BioJava** defines 8 types of secondary structure: 41 | 42 | E = extended strand, participates in β ladder 43 | B = residue in isolated β-bridge 44 | H = α-helix 45 | G = 3-helix (3-10 helix) 46 | I = 5-helix (π-helix) 47 | T = hydrogen bonded turn 48 | S = bend 49 | _ = loop (any other type) 50 | 51 | ## Parsing Secondary Structure in BioJava 52 | 53 | Currently there exist two alternatives to parse the secondary structure in **BioJava**: either from the PDB/mmCIF 54 | files of deposited structures (author assignment) or from the output file of a DSSP prediction. Both file types 55 | can be obtained from the PDB serevers, if available, so they can be automatically fetched by BioJava. 56 | 57 | As an example,you can find here the links of the structure **5PTI** to its 58 | [PDB file](http://www.rcsb.org/pdb/files/5PTI.pdb) (search for the HELIX and SHEET lines) and its 59 | [DSSP file](http://www.rcsb.org/pdb/files/5PTI.dssp). 60 | 61 | Note that the DSSP prediction output is more detailed and complete than the authors assignment. 62 | The choice of one or the other will depend on the use case. 63 | 64 | Below you can find some examples of how to parse and assign the SS of a `Structure`: 65 | 66 | ```java 67 | String pdbID = "5pti"; 68 | FileParsingParameters params = new FileParsingParameters(); 69 | //Only change needed to the normal Structure loading 70 | params.setParseSecStruc(true); //this is false as DEFAULT 71 | 72 | AtomCache cache = new AtomCache(); 73 | cache.setFileParsingParams(params); 74 | 75 | //The loaded Structure contains the SS assigned 76 | Structure s = cache.getStructure(pdbID); 77 | 78 | //If the more detailed DSSP prediction is required call this afterwards 79 | DSSPParser.fetch(pdbID, s, true); //Second parameter true overrides the previous SS 80 | ``` 81 | 82 | For more examples search in the **demo** package for `DemoLoadSecStruc`. 83 | 84 | ## Assignment of Secondary Structure in BioJava 85 | 86 | ### Algorithm 87 | 88 | The algorithm implemented in BioJava for the assignment of SS is `DSSP`. It is described in the paper from 89 | [Kabsch W. & Sander C. in 1983](http://onlinelibrary.wiley.com/doi/10.1002/bip.360221211/abstract) 90 | [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/6667333). 91 | A brief explanation of the algorithm and the output format can be found 92 | [here](http://swift.cmbi.ru.nl/gv/dssp/DSSP_3.html). 93 | 94 | The interface is very easy: a single method, named *calculate()*, calculates the SS and can assign it to the 95 | input Structure overriding any previous annotation, like in the DSSPParser. An example can be found below: 96 | 97 | ```java 98 | String pdbID = "5pti"; 99 | AtomCache cache = new AtomCache(); 100 | 101 | //Load structure without any SS assignment 102 | Structure s = cache.getStructure(pdbID); 103 | 104 | //Predict and assign the SS of the Structure 105 | SecStrucCalc ssp = new SecStrucCalc(); //Instantiation needed 106 | ssp.calculate(s, true); //true assigns the SS to the Structure 107 | ``` 108 | 109 | BioJava Class: 110 | [org.biojava.nbio.structure.secstruc.SecStrucCalc](http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html) 111 | 112 | ### Storage and Data Structures 113 | 114 | Because there are different sources of SS annotation, the data structure in **BioJava** that stores SS assignments 115 | has two levels. The top level `SecStrucInfo` is very general and only contains two properties: **assignment** 116 | (String describing the source of information) and **type** the SS type. 117 | 118 | However, there is an extended container `SecStrucState`, which is a subclass of `SecStrucInfo`, that stores 119 | all the information of the hydrogen bonding, turns, bends, etc. used for the SS prediction and present in the 120 | DSSP output file format. This information is only used in certain applications, and that is the reason for the 121 | more general `SecStrucInfo` class being used by default. 122 | 123 | In order to access the SS information of a `Structure`, the `SecStrucInfo` object needs to be obtained from the 124 | `Group` properties. Below you find an example of how to access and print residue by residue the SS information of 125 | a `Structure`: 126 | 127 | ```java 128 | //This structure should have SS assigned (by any of the methods described) 129 | Structure s; 130 | 131 | for (Chain c : s.getChains()) { 132 | for (Group g: c.getAtomGroups()){ 133 | if (g.hasAminoAtoms()){ //Only AA store SS 134 | //Obtain the object that stores the SS 135 | SecStrucInfo ss = (SecStrucInfo) g.getProperty(Group.SEC_STRUC); 136 | //Print information: chain+resn+name+SS 137 | System.out.println(c.getChainID()+" "+ 138 | g.getResidueNumber()+" "+ 139 | g.getPDBName()+" -> "+ss); 140 | } 141 | } 142 | } 143 | ``` 144 | 145 | ### Output Formats 146 | 147 | Once the SS has been assigned (either loaded or calculated), there are some easy formats to visualize it in **BioJava**: 148 | 149 | - **DSSP format**: the SS can be printed as a DSSP oputput file format, following the standards so that it can be 150 | parsed again. It is the safest way to serialize a SS annotation and recover it later, but it is probably the most 151 | complicated to visualize. 152 | 153 | <pre> 154 | # RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 155 | 1 1 A R 0 0 168 0, 0.0 54,-0.1 0, 0.0 5,-0.1 0.000 360.0 360.0 360.0 139.2 32.2 14.7 -11.8 156 | 2 2 A P > - 0 0 45 0, 0.0 3,-1.8 0, 0.0 4,-0.3 -0.194 360.0-122.0 -61.4 144.9 34.9 13.6 -9.4 157 | 3 3 A D G > S+ 0 0 122 1,-0.3 3,-1.6 2,-0.2 4,-0.2 0.790 108.3 71.4 -62.8 -28.5 35.8 10.0 -9.5 158 | 4 4 A F G > S+ 0 0 26 1,-0.3 3,-1.7 2,-0.2 -1,-0.3 0.725 83.7 70.4 -64.1 -23.3 35.0 9.7 -5.9 159 | </pre> 160 | 161 | - **FASTA format**: simple format that prints the SS type of each residue sequentially in the order of the aminoacids. 162 | It is the easiest to visualize, but the less informative of all. 163 | 164 | <pre> 165 | >5PTI_SS-annotation 166 | GGGGS S EEEEEEETTTTEEEEEEE SSS SS BSSHHHHHHHH 167 | </pre> 168 | 169 | - **Helix Summary**: similar to the FASTA format, but contain also information about the helical turns. 170 | 171 | <pre> 172 | 3 turn: >>><<< 173 | 4 turn: >444< >>>>XX<<<< 174 | 5 turn: >5555< 175 | SS: GGGGS S EEEEEEETTTTEEEEEEE SSS SS BSSHHHHHHHH 176 | AA: RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA 177 | </pre> 178 | 179 | - **Secondary Structure Elements**: another way to visualize the SS annotation is by compacting those sequential residues that share the same SS type and assigning an ID to the range. In this way, a structure can be described by 180 | a collection of helices, strands, turns, etc. and each one of the elements can be identified by an ID (i.e. helix 1 (H1), beta-strand 6 (E6), etc). 181 | 182 | <pre> 183 | G1: 3 - 6 184 | S1: 7 - 7 185 | S2: 13 - 13 186 | E1: 18 - 24 187 | T1: 25 - 28 188 | E2: 29 - 35 189 | S3: 37 - 39 190 | S4: 42 - 43 191 | B1: 45 - 45 192 | S5: 46 - 47 193 | H1: 48 - 55 194 | </pre> 195 | 196 | You can find examples of how to get the different file formats in the class `DemoSecStrucPred` in the **demo** 197 | package. 198 | 199 | ### Example 200 | 201 | Use dependencies from maven 202 | 203 | ```xml 204 | <dependency> 205 | <groupId>org.biojava</groupId> 206 | <artifactId>biojava-core</artifactId> 207 | <version>4.2.4</version> 208 | </dependency> 209 | <dependency> 210 | <groupId>org.biojava</groupId> 211 | <artifactId>biojava-modfinder</artifactId> 212 | <version>4.2.4</version> 213 | </dependency> 214 | ``` 215 | 216 | This is taken from the DemoLoadSecStruc example in the **demo** package. 217 | 218 | ```java 219 | 220 | import org.biojava.nbio.structure.Structure; 221 | import org.biojava.nbio.structure.StructureException; 222 | import org.biojava.nbio.structure.align.util.AtomCache; 223 | import org.biojava.nbio.structure.io.FileParsingParameters; 224 | import org.biojava.nbio.structure.secstruc.DSSPParser; 225 | import org.biojava.nbio.structure.secstruc.SecStrucCalc; 226 | import org.biojava.nbio.structure.secstruc.SecStrucInfo; 227 | import org.biojava.nbio.structure.secstruc.SecStrucTools; 228 | 229 | public static void main(String[] args) throws IOException, 230 | StructureException { 231 | 232 | String pdbID = "5pti"; 233 | 234 | // Only change needed to the DEFAULT Structure loading 235 | FileParsingParameters params = new FileParsingParameters(); 236 | params.setParseSecStruc(true); 237 | 238 | AtomCache cache = new AtomCache(); 239 | cache.setFileParsingParams(params); 240 | 241 | // Use PDB format, because SS cannot be parsed from mmCIF yet 242 | cache.setUseMmCif(false); 243 | 244 | // The loaded Structure contains the SS assigned by Author (simple) 245 | Structure s = cache.getStructure(pdbID); 246 | 247 | // Print the Author's assignment (from PDB file) 248 | System.out.println("Author's assignment: "); 249 | printSecStruc(s); 250 | 251 | // If the more detailed DSSP prediction is required call this 252 | DSSPParser.fetch(pdbID, s, true); 253 | 254 | // Print the assignment residue by residue 255 | System.out.println("DSSP assignment: "); 256 | printSecStruc(s); 257 | 258 | // finally use BioJava's built in DSSP-like secondary structure assigner 259 | SecStrucCalc secStrucCalc = new SecStrucCalc(); 260 | 261 | // calculate and assign 262 | secStrucCalc.calculate(s,true); 263 | printSecStruc(s); 264 | 265 | } 266 | 267 | public static void printSecStruc(Structure s){ 268 | List<SecStrucInfo> ssi = SecStrucTools.getSecStrucInfo(s); 269 | for (SecStrucInfo ss : ssi) { 270 | System.out.println(ss.getGroup().getChain().getName() + " " 271 | + ss.getGroup().getResidueNumber() + " " 272 | + ss.getGroup().getPDBName() + " -> " + ss.toString()); 273 | } 274 | } 275 | ``` 276 | 277 | 278 | <!--automatically generated footer--> 279 | 280 | --- 281 | 282 | Navigation: 283 | [Home](../README.md) 284 | | [Book 3: The Structure Modules](README.md) 285 | | Chapter 15 : Protein Secondary Structure 286 | 287 | Prev: [Chapter 14 : Protein Symmetry](symmetry.md) 288 | 289 | Next: [Chapter 17 : Special Cases](special.md) 290 | -------------------------------------------------------------------------------- /structure/seqres.md: -------------------------------------------------------------------------------- 1 | SEQRES and ATOM Records, Mapping to Uniprot (SIFTs) 2 | =================================================== 3 | 4 | How molecular sequences are linked to experimentally observed atoms. 5 | 6 | ## Sequences and Atoms 7 | 8 | In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB often contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments). 9 | 10 | Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of the regions that have been observed in an experiment and are available in the PDB map to UniProt. 11 | 12 | ![Screenshot of Protein Feature View at RCSB](https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)") 13 | 14 | As you can see, there are three PDB entries (PDB IDs [3LOH](http://www.rcsb.org/pdb/explore.do?structureId=3LOH), [2HR7](http://www.rcsb.org/pdb/explore.do?structureId=2RH7), [3BU3](http://www.rcsb.org/pdb/explore.do?structureId=3BU3)) that cover different regions of the UniProt sequence for the insulin receptor. 15 | 16 | The blue-boxes are regions for which atoms records are available. For the grey regions there is sequence information available in the PDB, but no coordinates. 17 | 18 | ## Seqres and Atom Records 19 | 20 | The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequence as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure. 21 | 22 | The **Atom** records provide coordinates where it was possible to observe them. 23 | 24 | <pre> 25 | Seqres groups -> sequence that has been used in the experiment 26 | Atom groups -> subset of Seqres groups for which coordinates could be obtained 27 | </pre> 28 | 29 | The *mmCIF/PDBx* file format contains the information how the Seqres and atom records are mapped onto each other. However the *PDB format* does not clearly specify how to resolve this mapping. BioJava contains a utility class that maps the Seqres to the Atom records when parsing PDB files. This class performs an alignment using dynamic programming, which can slow down the parsing process. If you do not require the precise Seqres to Atom mapping, you can turn it off like this: 30 | 31 | ```java 32 | AtomCache cache = new AtomCache(); 33 | 34 | FileParsingParameters params = cache.getFileParsingParams(); 35 | 36 | params.setAlignSeqRes(false); 37 | 38 | Structure structure = StructureIO.getStructure(...); 39 | 40 | ``` 41 | 42 | ## Accessing Seqres and Atom Groups 43 | 44 | By default BioJava loads both the Seqres and Atom groups into the [Chain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Chain.html) 45 | objects. 46 | 47 | <pre> 48 | Chain -> Seqres groups 49 | -> Atom groups 50 | </pre> 51 | 52 | Groups that are part of the Seqres sequence as well as of the Atom records are mapped onto each other. This means you 53 | can iterate over all Seqres groups in a chain and check, if they have observed atoms. 54 | 55 | ## Mapping from Uniprot to Atom Records 56 | 57 | The mapping between PDB and UniProt changes over time, due to the dynamic nature of biological data. The [PDBe](http://www.pdbe.org) has a project that provides up-to-date mappings between the two databases, the [SIFTs](http://www.ebi.ac.uk/pdbe/docs/sifts/) project. 58 | 59 | BioJava contains a parser for the SIFTs XML files. The [SiftsMappingProvider](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/sifts/SiftsMappingProvider.html) also acts similar to the AtomCache class, that we [discussed earlier](caching.md) and can automatically download and locally install SIFTs files. 60 | 61 | Here, how to request the mapping for one particular PDB ID. 62 | 63 | ```java 64 | List<SiftsEntity> entities = SiftsMappingProvider.getSiftsMapping("1gc1"); 65 | 66 | for (SiftsEntity e : entities){ 67 | System.out.println(e.getEntityId() + " " +e.getType()); 68 | 69 | for ( SiftsSegment seg: e.getSegments()) { 70 | System.out.println(" Segment: " + seg.getSegId() + " " + seg.getStart() + " " + seg.getEnd()) ; 71 | 72 | for ( SiftsResidue res: seg.getResidues() ) { 73 | System.out.println(" " + res); 74 | } 75 | } 76 | 77 | } 78 | ``` 79 | 80 | This gives the following output: 81 | 82 | <pre> 83 | C protein 84 | Segment: 1gc1_C_1_181 1 181 85 | SiftsResidue [pdbResNum=1, pdbResName=LYS, chainId=C, uniProtResName=K, uniProtPos=26, naturalPos=1, seqResName=LYS, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 86 | SiftsResidue [pdbResNum=2, pdbResName=LYS, chainId=C, uniProtResName=K, uniProtPos=27, naturalPos=2, seqResName=LYS, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 87 | SiftsResidue [pdbResNum=3, pdbResName=VAL, chainId=C, uniProtResName=V, uniProtPos=28, naturalPos=3, seqResName=VAL, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 88 | SiftsResidue [pdbResNum=4, pdbResName=VAL, chainId=C, uniProtResName=V, uniProtPos=29, naturalPos=4, seqResName=VAL, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 89 | SiftsResidue [pdbResNum=5, pdbResName=LEU, chainId=C, uniProtResName=L, uniProtPos=30, naturalPos=5, seqResName=LEU, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 90 | SiftsResidue [pdbResNum=6, pdbResName=GLY, chainId=C, uniProtResName=G, uniProtPos=31, naturalPos=6, seqResName=GLY, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 91 | SiftsResidue [pdbResNum=7, pdbResName=LYS, chainId=C, uniProtResName=K, uniProtPos=32, naturalPos=7, seqResName=LYS, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 92 | ... 93 | </pre> 94 | 95 | As you can see for each residue in the Uniprot / PDB sequence the matching counterpart is provided (if there is one). 96 | 97 | 98 | 99 | <!--automatically generated footer--> 100 | 101 | --- 102 | 103 | Navigation: 104 | [Home](../README.md) 105 | | [Book 3: The Structure Modules](README.md) 106 | | Chapter 7 : SEQRES and ATOM Records 107 | 108 | Prev: [Chapter 6 : Work with mmCIF/PDBx Files](mmcif.md) 109 | 110 | Next: [Chapter 8 : Structure Alignments](alignment.md) 111 | -------------------------------------------------------------------------------- /structure/special.md: -------------------------------------------------------------------------------- 1 | # Special Cases When Working with Protein Structures 2 | 3 | ## Alternate Locations 4 | 5 | Some PDB entries contain alternate conformations for parts of a structure or a group. BioJava merges alternate conformations into a single group, for which alternative groups are available. 6 | 7 | ```java 8 | 9 | Structure s = StructureIO.getStructure("1AAC"); 10 | 11 | Chain a = s.getChainByPDB("A"); 12 | 13 | Group g = a.getGroupByPDB( ResidueNumber.fromString("27")); 14 | 15 | System.out.println(g); 16 | for (Atom atom : g.getAtoms()) { 17 | System.out.print(atom.toPDB()); 18 | } 19 | 20 | 21 | int pos = 0; 22 | for (Group alt: g.getAltLocs()) { 23 | pos++; 24 | System.out.println("altLoc: " + pos + " " + alt); 25 | for (Atom atom : alt.getAtoms()) { 26 | System.out.print(atom.toPDB()); 27 | } 28 | } 29 | ``` 30 | 31 | ## Insertion Codes 32 | 33 | Insertion codes were introduced in the PDB, when people wanted to compare the "same" protein between different species. As it turned out the "same" protein was not showing exactly the same sequence in different species and in some cases insertions were found, resulting in a longer sequences. For the comparison of the proteins the numbering was considered important to be preserved. This was so one could say that for example "HIS 75" is an important residue. To make up for the mismatch in the lengths of the sequences insertion codes were introduced. As a consequence, in PDB, a particular residue is identified uniquely by three data items: chain identifier, residue number, and insertion code. 34 | 35 | BioJava contains the ResidueNumber object to help with characterizing each group in a file. PDB ID 1IGY contains some extra residues around chain B position 82. BioJava can represent these like this: 36 | 37 | ```java 38 | Structure s1 = StructureIO.getStructure("1IGY"); 39 | 40 | Chain b = s1.getChainByPDB("B"); 41 | 42 | for (Group g : b.getAtomGroups()){ 43 | System.out.println(g.getResidueNumber() + " " + g.getPDBName() + " " + g.getResidueNumber().getInsCode()); 44 | } 45 | 46 | ``` 47 | 48 | This will display the following table: (residuenumber, name, insertion code) 49 | 50 | ``` 51 | ... 52 | 81 HIS null 53 | 82 LEU null 54 | 82A SER A 55 | 82B SER B 56 | 82C LEU C 57 | 83 THR null 58 | 84 SER null 59 | ... 60 | ``` 61 | 62 | 63 | ## Chromophores 64 | 65 | A [chromophore](http://en.wikipedia.org/wiki/Chromophore) is the part of a molecule responsible for its color. Some proteins, such as GFP contain a chromopohre that consists of three modified residues. BioJava represents this as a single group in terms of atoms, however as three amino acids when creating the amino acid sequences. 66 | 67 | ```java 68 | 69 | 70 | // make sure we download chemical component definitions 71 | // which is required for correctly representing the chromophore 72 | FileParsingParameters params = new FileParsingParameters(); 73 | params.setLoadChemCompInfo(true); 74 | 75 | // now register the parameters in the cache 76 | AtomCache cache = new AtomCache(); 77 | cache.setFileParsingParams(params); 78 | StructureIO.setAtomCache(cache); 79 | 80 | 81 | // request a GFP protein 82 | Structure s1 = StructureIO.getStructure("2pxw"); 83 | 84 | // and print out the internals 85 | System.out.println(s1.getPDBHeader().toPDB()); 86 | 87 | // chromophore is at PDB residue number 66 88 | for ( Chain c : s1.getChains()) { 89 | 90 | System.out.println("Chain " + c.getChainID() + 91 | " internal " + c.getInternalChainID() + 92 | " ligands " + c.getAtomLigands().size()); 93 | System.out.println(" 10 20 30 40 50 60"); 94 | System.out.println("1234567890123456789012345678901234567890123456789012345678901234567890"); 95 | System.out.println(c.getAtomSequence()); 96 | 97 | int pos = 0 ; 98 | for (Group g: c.getAtomGroups()) { 99 | pos++; 100 | System.out.println(pos + " " + g.getResidueNumber() + " " + g.getPDBName() + " " + g.getType() + " " + g.getChemComp().getOne_letter_code() + " " + g.getChemComp().getType() ); 101 | } 102 | } 103 | ``` 104 | 105 | This will give this output, note 'DYG' at position 63. 106 | 107 | ``` 108 | 60 109 | ...01234567890 110 | ...AAFDYGNRVFTEY... 111 | ``` 112 | 113 | DYG is an unusual group - it has 3 characters as a result of .getOne_letter_code() 114 | 115 | ``` 116 | ... 117 | 62 65 PHE amino F L-PEPTIDE LINKING 118 | 63 66 DYG amino DYG L-PEPTIDE LINKING 119 | 64 69 ASN amino N L-PEPTIDE LINKING 120 | ... 121 | ``` 122 | 123 | ## Microheterogeneity 124 | 125 | 126 | 127 | <!--automatically generated footer--> 128 | 129 | --- 130 | 131 | Navigation: 132 | [Home](../README.md) 133 | | [Book 3: The Structure Modules](README.md) 134 | | Chapter 17 : Special Cases 135 | 136 | Prev: [Chapter 15 : Protein Secondary Structure](secstruc.md) 137 | 138 | Next: [Chapter 18 : Status Information](lists.md) 139 | -------------------------------------------------------------------------------- /structure/structure-data-model.md: -------------------------------------------------------------------------------- 1 | # The BioJava-Structure Data Model 2 | 3 | A biologically and chemically meaningful data representation of PDB/mmCIF. 4 | 5 | ## The Basics 6 | 7 | BioJava at its core is a collection of file parsers and (in some cases) data models to represent frequently used biological data. The protein-structure modules represent macromolecular data in a way that should make it easy to work with. The representation is essentially independent of the underlying file format and the user can chose to work with either PDB or mmCIF files and still get an almost identical data representation. (There can be subtile differences between PDB and mmCIF data, for example the atom indices in a few entries are not 100% identical) 8 | 9 | ## The Main Hierarchy 10 | 11 | BioJava provides a flexible data structure for managing protein structural data. The 12 | [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) class is the main container. 13 | 14 | A `Structure` has a hierarchy of sub-objects: 15 | 16 | <pre> 17 | Structure 18 | | 19 | Model(s) 20 | | 21 | Chain(s) 22 | | 23 | Group(s) -> Chemical Component Definition 24 | | 25 | Atom(s) 26 | </pre> 27 | 28 | All `Structure` objects contain one or more `Models`. That means also X-ray structures contain a "virtual" model which serves as a container for the chains. This allows to represent multi-model X-ray structures, e.g. from time-series analysis. The most common way to access chains is via: 29 | 30 | ```java 31 | List<Chain> chains = structure.getChains(); 32 | ``` 33 | 34 | This works for both NMR and X-ray based structures and by default the first `Model` is getting accessed. 35 | 36 | ## Working with Atoms 37 | 38 | Different ways are provided how to access the data contained in a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html). 39 | If you want to directly access an array of representative [Atoms](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Atom.html) (CA for proteins, P in nucleotides),you can use the utility class called [StructureTools](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureTools.html) 40 | 41 | ```java 42 | // get all representative atoms in the structure, one for residue 43 | Atom[] caAtoms = StructureTools.getRepresentativeAtomArray(structure); 44 | ``` 45 | 46 | Alternatively you can access atoms also by their parent-group. 47 | 48 | ## Loop over All the Data 49 | 50 | Here an example that loops over the whole data model and prints out the HEM groups of hemoglobin: 51 | 52 | ```java 53 | Structure structure = StructureIO.getStructure("4hhb"); 54 | 55 | List<Chain> chains = structure.getChains(); 56 | 57 | System.out.println(" # chains: " + chains.size()); 58 | 59 | for (Chain c : chains) { 60 | 61 | System.out.println(" Chain: " + c.getId() + " # groups with atoms: " + c.getAtomGroups().size()); 62 | 63 | for (Group g: c.getAtomGroups()){ 64 | 65 | if ( g.getPDBName().equalsIgnoreCase("HEM")) { 66 | 67 | System.out.println(" " + g); 68 | 69 | for (Atom a: g.getAtoms()) { 70 | 71 | System.out.println(" " + a); 72 | 73 | } 74 | } 75 | } 76 | } 77 | ``` 78 | 79 | ## Working with Groups 80 | 81 | The [Group](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Group.html) interface defines all methods common to a group of atoms. There are 3 types of Groups: 82 | 83 | * [AminoAcid](http://www.biojava.org/docs/api4.2.1/org/biojava/nbio/structure/AminoAcid.html) 84 | * [Nucleotide](http://www.biojava.org/docs/api4.2.1/org/biojava/nbio/structure/NucleotideImpl.html) 85 | * [Hetatom](http://www.biojava.org/docs/api4.2.1/org/biojava/nbio/structure/HetatomImpl.html) 86 | 87 | In order to get all amino acids that have been observed in a PDB chain, you can use the following utility method: 88 | 89 | ```java 90 | Chain chain = structure.getPolyChainByPDB("A"); 91 | List<Group> groups = chain.getAtomGroups(GroupType.AMINOACID); 92 | for (Group group : groups) { 93 | SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC); 94 | 95 | // print the secondary structure assignment 96 | System.out.println(group + " -- " + secStrucInfo); 97 | } 98 | ``` 99 | 100 | In a similar way you can access all nucleotide groups by 101 | ```java 102 | chain.getAtomGroups(GroupType.NUCLEOTIDE); 103 | ``` 104 | 105 | The Hetatom groups are access in a similar fashion: 106 | ```java 107 | chain.getAtomGroups(GroupType.HETATM); 108 | ``` 109 | 110 | 111 | Since all 3 types of groups are implementing the Group interface, you can also iterate over all groups and check for the instance type: 112 | 113 | ```java 114 | List<Group> allgroups = chain.getAtomGroups(); 115 | for (Group group : allgroups) { 116 | if (group.isAminoAcid()) { 117 | SecStrucInfo secStrucInfo = (SecStrucInfo) group.getProperty(Group.SEC_STRUC); 118 | System.out.println(group + " -- " + secStrucInfo); 119 | } 120 | } 121 | ``` 122 | 123 | ## A Note 124 | 125 | The detection of the groups works really well in connection with the [Chemical Component Dictionary](checmcomp.md), as we will discuss in the next section. Without this dictionary, there can be inconsistencies in particular with chemically modified residues. 126 | 127 | ## Entities and Chains 128 | 129 | Entities are the distinct chemical components of structures in the PDB. 130 | Unlike chains, entities do not include duplicate copies and each entity is different from every other 131 | entity in the structure. There are different types of entities. Polymer entities include Protein, DNA, 132 | and RNA. Ligands are smaller chemical components that are not part of a polymer entity. 133 | 134 | <pre> 135 | Structure -> Entity -> Chain 136 | </pre> 137 | 138 | To explain this with an example, hemoglobin (e.g. PDB ID 4HHB) has two components, alpha 139 | and beta. Each of the entities has two copies (= chains) in the structure. IN 4HHB, alpha 140 | has the two chains with the IDs A, and C and beta the chains B, and D. In total, hemoglobin is 141 | built up out of four chains. 142 | 143 | This prints all the entities in a structure 144 | ```java 145 | Structure structure = StructureIO.getStructure("4hhb"); 146 | 147 | System.out.println(structure); 148 | 149 | System.out.println(" # of compounds (entities) " + structure.getEntityInfos().size()); 150 | 151 | for ( EntityInfo entity: structure.getEntityInfos()) { 152 | System.out.println(" " + entity); 153 | } 154 | ``` 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | <!--automatically generated footer--> 163 | 164 | --- 165 | 166 | Navigation: 167 | [Home](../README.md) 168 | | [Book 3: The Structure Modules](README.md) 169 | | Chapter 3 : Structure Data Model 170 | 171 | Prev: [Chapter 2 : First Steps](firststeps.md) 172 | 173 | Next: [Chapter 4 : Local Installations](caching.md) 174 | -------------------------------------------------------------------------------- /structure/symmetry.md: -------------------------------------------------------------------------------- 1 | Protein Symmetry using BioJava 2 | ================================================================ 3 | 4 | BioJava can be used to detect, analyze, and visualize **symmetry** and 5 | **pseudo-symmetry** in the **quaternary** (biological assembly) and tertiary 6 | (**internal**) structural levels of proteins. 7 | 8 | ## Quaternary Symmetry 9 | 10 | The **quaternary symmetry** of a structure defines the relation and arrangement of the individual chains or groups of chains that are part of a biological assembly. 11 | For a more exhaustive explanation about protein quaternary symmetery and the different types visit the [PDB help page](http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html). 12 | 13 | In the **quaternary symmetry** detection problem, we are given a set of chains (subunits) that are part of a biological assembly as input, defined by their atomic coordinates, and we are required to find the higest overall symmetry group that 14 | relates them as ouptut. 15 | The solution is divided into the following steps: 16 | 17 | 1. First, we need to identify the chains that are identical (or similar 18 | in the pseudo-symmetry case). For that purpose, we perform a pairwise alignment of all 19 | chains and identify **clusters of identical or similar subunits**. 20 | 2. Next, we reduce each of the polypeptide chains to a single point, their **centroid** (center of mass). 21 | 3. Afterwards, we try different **symmetry operations** using a grid search to superimpose the chain centroids 22 | and score them using the RMSD. 23 | 4. Finally, based on the parameters (cutoffs), we determine the **overall symmetry** of the 24 | structure, with the symmetry relations obtained in the previous step. 25 | 5. In case of asymmetric structure, we discard combinatorially a number of chains and try 26 | to detect any **local symmetries** present (symmetry that does not involve all subunits of the biological assembly). 27 | 28 | The **quaternary symmetry** detection algorithm is implemented in the biojava class 29 | [QuatSymmetryDetector](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/core/QuatSymmetryDetector). 30 | An example of how to use it programatically is shown below: 31 | 32 | ```java 33 | // First download the structure in the biological assembly form 34 | Structure s; 35 | 36 | // Set some parameters if needed different than DEFAULT - see descriptions 37 | QuatSymmetryParameters parameters = new QuatSymmetryParameters(); 38 | SubunitClustererParameters clusterParams = new SubunitClustererParameters(); 39 | 40 | // Instantiate the detector 41 | QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams); 42 | 43 | // Static methods in QuatSymmetryDetector perform the calculation 44 | QuatSymmetryResults globalResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams); 45 | List<QuatSymmetryResults> localResults = QuatSymmetryDetector.getLocalSymmetries(s, parameters, clusterParams); 46 | 47 | ``` 48 | See also the [demo](https://github.com/biojava/biojava/blob/885600670be75b7f6bc5216bff52a93f43fff09e/biojava-structure/src/main/java/demo/DemoSymmetry.java#L37-L59) provided in **BioJava** for a real case working example. 49 | 50 | The returned `QuatSymmetryResults` object contains all the information of the subunit clustering and structural symmetry. 51 | This object will be used later to obtain axes of symmetry, point group name, stoichiometry or even display the results in Jmol. 52 | The return object of quaternary symmetry (`QuatSymmetryResults`) contains the 53 | In case of asymmetrical structure, the result is a C1 point group. 54 | The return type of the local symmetry is a `List` because there can be multiple valid options of local symmetry. 55 | The list will be empty if there exist no local symmetries in the structure. 56 | 57 | 58 | ### Global Symmetry 59 | 60 | In the **global symmetry** mode all chains have to be part of the symmetry result. 61 | 62 | #### Point Group 63 | 64 | In a **point group** a single or multiple rotation axes define the overall symmetry 65 | operations, with the property that all the axes coincide in the same point. 66 | 67 | ![PDB ID 1VYM](img/symm_pg.png) 68 | 69 | #### Helical 70 | 71 | In **helical** symmetry there is a single axis with rotation and translation 72 | components. 73 | 74 | ![PDB ID 4UDV](img/symm_helical.png) 75 | 76 | ### Local Symmetry 77 | 78 | In **local symmetry** a number of chains is left out, so that the symmetry only applies to a subset of chains. 79 | 80 | ![PDB ID 4F88](img/symm_local.png) 81 | 82 | ### Pseudo-Symmetry 83 | 84 | In **pseudo-symmetry** the chains related by the symmetry are not completely 85 | identical, but they share a sequence or structural similarity above the pseudo-symmetry 86 | similarity threshold. 87 | 88 | If we consider hemoglobin, at a 95% sequence identity threshold the alpha and 89 | beta subunits are considered different, which correspond to an A2B2 stoichiometry 90 | and a C2 point group. At the structural similarity level, all four chains are 91 | considered homologous (~45% sequence identity) with an A4 pseudostoichiometry and 92 | D2 pseudosymmetry. 93 | 94 | ![PDB ID 4HHB](img/symm_pseudo.png) 95 | 96 | ## Internal Symmetry 97 | 98 | **Internal symmetry** refers to the symmetry present in a single chain, that is, 99 | the tertiary structure. The algorithm implemented in biojava to detect internal 100 | symmetry is called **CE-Symm**. 101 | 102 | ### CE-Symm 103 | 104 | The **CE-Symm** algorithm was originally developed by [Myers-Turnbull D., Bliven SE., 105 | Rose PW., Aziz ZK., Youkharibache P., Bourne PE. & Prlić A. in 2014] 106 | (http://www.sciencedirect.com/science/article/pii/S0022283614001557) [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/24681267). 107 | As the name of the algorithm explicitly states, **CE-Symm** uses the Combinatorial 108 | Extension (**CE**) algorithm to generate an alignment of the structure chain to itself, 109 | disabling the identity alignment (the diagonal of the **DotPlot** representation of a 110 | structure alignment). This allows the identification of alternative self-alignments, 111 | which are related to symmetry and/or structural repeats inside the chain. 112 | 113 | By a procedure called **refinement**, the subunits of the chain that are part of the symmetry 114 | are defined and a **multiple alignment** is created. This process can be thought as to 115 | divide the chain into other subchains, and then superimposing each subchain to each other to 116 | create a multiple alignment of the subunits, respecting the symmetry axes. 117 | 118 | The **internal symmetry** detection algorithm is implemented in the biojava class 119 | [CeSymm](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/internal/CeSymm). 120 | It returns a `MultipleAlignment` object, see the explanation of the model in [Data Models](alignment-data-model.md), 121 | that describes the similarity of the internal repeats. In case of no symmetry detected, the 122 | returned alignment represents the optimal self-alignment produced by the first step of the **CE-Symm** 123 | algorithm. 124 | 125 | ```java 126 | //Input the atoms in a chain as an array 127 | Atom[] atoms = StructureTools.getRepresentativeAtomArray(chain); 128 | 129 | //Initialize the algorithm 130 | CeSymm ceSymm = new CeSymm(); 131 | 132 | //Choose some parameters 133 | CESymmParameters params = ceSymm.getParameters(); 134 | params.setRefineMethod(RefineMethod.SINGLE); 135 | params.setOptimization(true); 136 | params.setMultipleAxes(true); 137 | 138 | //Run the symmetry analysis - alignment as an output 139 | MultipleAlignment symmetry = ceSymm.analyze(atoms, params); 140 | 141 | //Test if the alignment returned was refined with 142 | boolean refined = SymmetryTools.isRefined(symmetry); 143 | 144 | //Get the axes of symmetry from the aligner 145 | SymmetryAxes axes = ceSymm.getSymmetryAxes(); 146 | 147 | //Display the results in jmol with the SymmetryDisplay 148 | SymmetryDisplay.display(symmetry, axes); 149 | 150 | //Show the point group, if any of the internal symmetry 151 | QuatSymmetryResults pg = SymmetryTools.getQuaternarySymmetry(symmetry); 152 | System.out.println(pg.getSymmetry()); 153 | 154 | ``` 155 | 156 | To enable some extra features in the display, a `SymmetryDisplay` 157 | class has been created, although the `MultipleAlignmentDisplay` method 158 | can also be used for that purpose (it will not show symmetry axes or 159 | symmetry menus). 160 | 161 | Lastly, the `SymmetryGUI` class in the **structure-gui** package 162 | provides a GUI to trigger internal symmetry analysis, equivalent 163 | to the GUI to trigger structure alignments. 164 | 165 | ### Symmetry Display 166 | 167 | The symmetry display is similar to the **quaternary symmetry**, because 168 | part of the code is shared. See for example this beta-propeller (1U6D), 169 | where the repeated beta-sheets are connected by a linker forming a C6 170 | point group internal symmetry: 171 | 172 | ![PDB ID 1U6D](img/symm_internal.png) 173 | 174 | #### Hierarchical Symmetry 175 | 176 | One additional feature of the **internal symmetry** display is the representation 177 | of hierarchical symmetries and repeats. Contrary to point groups, some structures 178 | have different **levels** of symmetry. That is, the whole strucutre has, e.g. C2 179 | symmetry and, at the same time, each of the two parts has C2 symmetry, but the axes 180 | of both levels are not related by a point group (i.e. they do not cross to a single 181 | point). 182 | 183 | A very clear example are the beta-gamma-crystallins, like 4GCR: 184 | 185 | ![PDB ID 4GCR](img/symm_hierarchy.png) 186 | 187 | #### Subunit Multiple Alignment 188 | 189 | Another feature of the display is the option to show the **multiple alignment** of 190 | the symmetry related subunits created during the **refinement** process. Search for 191 | the option *Subunit Superposition* in the *symmetry* menu of the Jmol window. For 192 | the previous example the display looks like that: 193 | 194 | ![PDB ID 4GCR](img/symm_subunits.png) 195 | 196 | The subunit display highlights the differences and similarities between the symmetry 197 | related subunits of the chain, and helps the user to identify conseved and divergent 198 | regions, with the help of the *Sequence Alignment Panel*. 199 | 200 | ## Quaternary + Internal Overall Symmetry 201 | 202 | Finally, the internal and quaternary symmetries can be merged to obtain the 203 | overall combined symmetry. As we have seen before, the protein 1VYM is a DNA-clamp that 204 | has three chains arranged in a C3 symmetry. 205 | Each chain is internally fourfold symmetric with two levels of symmetry. We can analyze the overall symmetry of the structure by considering together the C3 quaternary symmetry and the fourfold internal symmetry. 206 | In this case, the internal symmetry **augments** the point group of the quaternary symmetry to a D6 overall symmetry, as we can see in the figure below: 207 | 208 | ![PDB ID 1VYM](img/symm_combined.png) 209 | 210 | An example of how to toggle the **combined symmetry** (quaternary + internal symmetries) programatically is shown below: 211 | 212 | ```java 213 | // First download the structure in the biological assembly form 214 | Structure s; 215 | 216 | // Initialize default parameters 217 | QuatSymmetryParameters parameters = new QuatSymmetryParameters(); 218 | SubunitClustererParameters clusterParams = new SubunitClustererParameters(); 219 | 220 | // In SubunitClustererParameters set the clustering method to STRUCTURE and the internal symmetry option to true 221 | clusterParams.setClustererMethod(SubunitClustererMethod.STRUCTURE); 222 | clusterParams.setInternalSymmetry(true); 223 | 224 | // You can lower the default structural coverage to improve the recall 225 | clusterParams.setStructureCoverageThreshold(0.75); 226 | 227 | // Instantiate the detector 228 | QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams); 229 | 230 | // Static methods in QuatSymmetryDetector perform the calculation 231 | QuatSymmetryResults overallResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams); 232 | 233 | ``` 234 | 235 | See also the [test](https://github.com/biocryst/biojava/blob/df22da37a86a0dba3fb35bee7e17300d402ab469/biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/symmetry/TestQuatSymmetryDetectorExamples.java#L167-L192) provided in **BioJava** for a real case working example. 236 | 237 | 238 | ## Please Cite 239 | 240 | **Analyzing the symmetrical arrangement of structural repeats in proteins with CE-Symm**<br/> 241 | *Spencer E Bliven, Aleix Lafita, Peter W Rose, Guido Capitani, Andreas Prlić, & Philip E Bourne* <br/> 242 | [PLOS Computational Biology (2019) 15 (4):e1006842.](https://journals.plos.org/ploscompbiol/article/citation?id=10.1371/journal.pcbi.1006842) <br/> 243 | [![doi](https://img.shields.io/badge/doi-10.1371%2Fjournal.pcbi.1006842-blue.svg?style=flat)](https://doi.org/10.1371/journal.pcbi.1006842) [![pubmed](https://img.shields.io/badge/pubmed-31009453-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/31009453) 244 | 245 | 246 | 247 | <!--automatically generated footer--> 248 | 249 | --- 250 | 251 | Navigation: 252 | [Home](../README.md) 253 | | [Book 3: The Structure Modules](README.md) 254 | | Chapter 14 : Protein Symmetry 255 | 256 | Prev: [Chapter 13 - Finding all Interfaces in Crystal: Crystal Contacts](crystal-contacts.md) 257 | 258 | Next: [Chapter 15 : Protein Secondary Structure](secstruc.md) 259 | --------------------------------------------------------------------------------