├── .gitignore
├── README.md
├── alignment
├── README.md
├── img
│ └── alignment.png
├── installation.md
└── smithwaterman.md
├── bin
└── update_index.py
├── core
├── README.md
├── img
│ └── core.png
├── installation.md
├── readwrite.md
├── sequences.md
└── translating.md
├── genomics
├── README.md
├── chromosomeposition.md
├── genebank.md
├── genenames.md
├── gff.md
├── img
│ └── genomics.png
├── installation.md
├── karyotype.md
└── twobit.md
├── installation.md
├── license.md
├── logo.png
├── modfinder
├── README.md
├── add-protein-modification.md
├── identify-protein-modifications.md
├── installation.md
└── supported-protein-modifications.md
├── protein-disorder
└── README.md
└── structure
├── README.md
├── alignment-data-model.md
├── alignment.md
├── asa.md
├── bioassembly.md
├── caching.md
├── chemcomp.md
├── contact-map.md
├── crystal-contacts.md
├── externaldb.md
├── firststeps.md
├── img
├── 143px-Selenomethionine-from-xtal-3D-balls.png
├── 1cfd_1cll_fatcat.png
├── 1cfd_1cll_fatcat.xcf
├── 1cfd_1cll_flexible.png
├── 1cfd_1cll_rigid.png
├── 1dan_scop.png
├── 1gav_asym.png
├── 1gav_biounit.png
├── 1hho_asym.png
├── 1hho_biounit.png
├── 1m4x_bio_r_250.jpg
├── 2hyn_1zll.png
├── 3cna.A_2pel.A_cecp.png
├── 4hhb_bio_r_250.jpg
├── 4hhb_jmol.png
├── alignment_gui.png
├── alignmentpanel.png
├── cath_1dan.png
├── database_search.png
├── database_search_results.png
├── multiple_gui.png
├── multiple_jmol_globins.png
├── multiple_panel_globins.png
├── symm_combined.png
├── symm_helical.png
├── symm_hierarchy.png
├── symm_internal.png
├── symm_local.png
├── symm_pg.png
├── symm_pseudo.png
└── symm_subunits.png
├── installation.md
├── lists.md
├── mmcif.md
├── secstruc.md
├── seqres.md
├── special.md
├── structure-data-model.md
└── symmetry.md
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | .profile
3 | .settings
4 | .idea
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Tutorial
2 | ===
3 |
4 | A brief introduction into [BioJava](https://www.biojava.org).
5 | -----
6 |
7 | The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava. This tutorial is still under development, hence not yet comprehensive for the entire library. Please also check other sources of [documentation](https://biojava.org/wiki/Documentation).
8 |
9 | The examples within the tutorial are intended to work with the most recent version of BioJava. Please do submit a [new issue](https://github.com/biojava/biojava-tutorial/issues) if you find any problems.
10 |
11 | The tutorial is subdivided into several books, corresponding to the respective BioJava modules. Each book is further subdivided into several chapters that intend to describe the main functionality of the module in order of increasing complexity.
12 |
13 | ## Index
14 |
15 | [Quick Installation](installation.md)
16 |
17 | Book 1: [The Core Module](core/README.md), basic working with sequences.
18 |
19 | Book 2: [The Alignment Module](alignment/README.md), pairwise and multiple alignments of protein sequences.
20 |
21 | Book 3: [The Structure Modules](structure/README.md), everything related to working with 3D structures.
22 |
23 | Book 4: [The Genomics Module](genomics/README.md), working with genomic data.
24 |
25 | Book 5: [The Protein-Disorder Module](protein-disorder/README.md), predicting protein-disorder.
26 |
27 | Book 6: [The ModFinder Module](modfinder/README.md), identifying protein modifications in 3D structures
28 |
29 | ## License
30 |
31 | The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license, available [here](license.md).
32 |
33 | ## Please Cite
34 |
35 | **BioJava 5: A community driven open-source bioinformatics library**
36 | *Aleix Lafita, Spencer Bliven, Andreas Prlić, Dmytro Guzenko, Peter W. Rose, Anthony Bradley, Paolo Pavan, Douglas Myers-Turnbull, Yana Valasatava, Michael Heuer, Matt Larson, Stephen K. Burley, & Jose M. Duarte*
37 | [PLOS Computational Biology (2019) 15 (2):e1006791.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006791)
38 | [](https://doi.org/10.1371/journal.pcbi.1006791) [](http://www.ncbi.nlm.nih.gov/pubmed/30735498)
39 |
40 |
41 |
42 |
--------------------------------------------------------------------------------
/alignment/README.md:
--------------------------------------------------------------------------------
1 | The BioJava - Alignment Module
2 | =====================================================
3 |
4 | A tutorial for the alignment module of [BioJava](http://www.biojava.org).
5 |
6 | ## About
7 |
10 | ![]() |
12 |
13 | The alignment module of BioJava provides an API that contains
14 |
|
20 |
32 | mvn package 33 |34 | 35 | on your project, the BioJava dependencies will be automatically downloaded and installed for you. 36 | 37 | 38 | 39 | 40 | --- 41 | 42 | Navigation: 43 | [Home](../README.md) 44 | | [Book 2: The Alignment Module](README.md) 45 | | Chapter 1 : Installation 46 | -------------------------------------------------------------------------------- /alignment/smithwaterman.md: -------------------------------------------------------------------------------- 1 | Smith Waterman - Local Alignment 2 | ################################ 3 | 4 | BioJava contains implementation for various protein sequence and 3D structure alignment algorithms. Here is how to run a local, Smith-Waterman, alignment of two protein sequences: 5 | 6 | 7 | 8 | ```java 9 | public static void main(String[] args) throws Exception { 10 | 11 | String uniprotID1 = "P69905"; 12 | String uniprotID2 = "P68871"; 13 | 14 | ProteinSequence s1 = getSequenceForId(uniprotID1); 15 | ProteinSequence s2 = getSequenceForId(uniprotID2); 16 | 17 | SubstitutionMatrix
10 | ![]() |
12 |
13 | The core module of BioJava provides an API that provides
14 |
|
20 |
33 | mvn package 34 |35 | 36 | on your project, the BioJava dependencies will be automatically downloaded and installed for you. 37 | 38 | 39 | 40 | 41 | --- 42 | 43 | Navigation: 44 | [Home](../README.md) 45 | | [Book 1: The Core Module](README.md) 46 | | Chapter 1 : Installation 47 | 48 | Next: [Chapter 2 : Basic Sequence types](sequences.md) 49 | -------------------------------------------------------------------------------- /core/readwrite.md: -------------------------------------------------------------------------------- 1 | Reading and Writing of Basic sequence file formats 2 | ================================================== 3 | 4 | 5 | TODO: needs more examples 6 | 7 | 8 | ## FASTA 9 | 10 | A quick way of parsing a FASTA file is using the FastaReaderHelper class. 11 | 12 | Here an example that parses a UniProt FASTA file into a protein sequence. 13 | 14 | ```java 15 | public static ProteinSequence getSequenceForId(String uniProtId) throws Exception { 16 | URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId)); 17 | ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId); 18 | System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader()); 19 | System.out.println(); 20 | 21 | return seq; 22 | } 23 | ``` 24 | 25 | 26 | BioJava can also be used to parse large FASTA files. The example below can parse a 1GB (compressed) version of TREMBL with standard memory settings. 27 | 28 | 29 | ```java 30 | 31 | 32 | 33 | /** Download a large file, e.g. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz 34 | * and pass in path to local location of file 35 | * 36 | * @param args 37 | */ 38 | public static void main(String[] args) { 39 | 40 | if ( args.length < 1) { 41 | System.err.println("First argument needs to be path to fasta file"); 42 | return; 43 | } 44 | 45 | File f = new File(args[0]); 46 | 47 | if ( ! f.exists()) { 48 | System.err.println("File does not exist " + args[0]); 49 | return; 50 | } 51 | 52 | try { 53 | 54 | // automatically uncompresses files using InputStreamProvider 55 | InputStreamProvider isp = new InputStreamProvider(); 56 | 57 | InputStream inStream = isp.getInputStream(f); 58 | 59 | FastaReader
10 | ![]() |
12 |
13 | The genome module of BioJava provides an API that allows to
14 |
|
20 |
48 | mvn package 49 |50 | 51 | on your project, the BioJava dependencies will be automatically downloaded and installed for you. 52 | 53 | 54 | 55 | 56 | --- 57 | 58 | Navigation: 59 | [Home](../README.md) 60 | | [Book 4: The Genomics Module](README.md) 61 | | Chapter 1 : Installation 62 | 63 | Next: [Chapter 2 : gene names information](genenames.md) 64 | -------------------------------------------------------------------------------- /genomics/karyotype.md: -------------------------------------------------------------------------------- 1 | Parsing a karyotype file from the UCSC genome browser 2 | ===================================================== 3 | 4 | Karyotype information for the human genome can be read from UCSC's [cytoBand.txt.gz](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz) 5 | file. 6 | 7 | ```java 8 | 9 | CytobandParser me = new CytobandParser(); 10 | try { 11 | SortedSet
41 | mvn package 42 |43 | 44 | on your project, the BioJava dependencies will be automatically downloaded and installed for you. 45 | 46 | -------------------------------------------------------------------------------- /logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biojava/biojava-tutorial/5b44b5651f00584df1c2a90c543c7b329cb7315b/logo.png -------------------------------------------------------------------------------- /modfinder/README.md: -------------------------------------------------------------------------------- 1 | The ModFinder Module of BioJava 2 | ===================================================== 3 | 4 | A tutorial for the modfinder module of [BioJava](http://www.biojava.org) 5 | 6 | ## About 7 |
10 | ![]() |
12 | 13 | The modfinder module of BioJava provides an API for identification of protein pre-, co-, and post-translational modifications from structures. 14 | | 15 |
35 | mvn package 36 |37 | 38 | on your project, the BioJava dependencies will be automatically downloaded and installed for you. 39 | 40 | 41 | 42 | 43 | --- 44 | 45 | Navigation: 46 | [Home](../README.md) 47 | | [Book 6: The ModFinder Modules](README.md) 48 | | Chapter 1 : Installation 49 | 50 | Next: [Chapter 2 : How to get the list of supported protein modifications](supported-protein-modifications.md) 51 | -------------------------------------------------------------------------------- /modfinder/supported-protein-modifications.md: -------------------------------------------------------------------------------- 1 | How to get a list of supported protein modifications? 2 | === 3 | 4 | The protmod module contains [an XML file](https://github.com/biojava/biojava/blob/master/biojava-modfinder/src/main/resources/org/biojava/nbio/protmod/ptm_list.xml), defining a list of protein modifications, retrieved from [Protein Data Bank Chemical Component Dictionary](http://www.wwpdb.org/ccd.html), [RESID](http://pir.georgetown.edu/resid/), and [PSI-MOD](http://www.psidev.info/MOD). It contains many common modifications such glycosylation, phosphorylation, acelytation, methylation, etc. Crosslinks are also included, such disulfide bonds and iso-peptide bonds. 5 | 6 | The protmod maintains a registry of supported protein modifications. The list of protein modifications contained in the XML file will be automatically loaded. You can [define and register a new protein modification](add-protein-modification.md) if it has not been defined in the XML file. From the protein modification registry, a user can retrieve: 7 | - all protein modifications, 8 | - a protein modification by ID, 9 | - a set of protein modifications by RESID ID, 10 | - a set of protein modifications by PSI-MOD ID, 11 | - a set of protein modifications by PDBCC ID, 12 | - a set of protein modifications by category (attachment, modified residue, crosslink1, crosslink2, …, crosslink7), 13 | - a set of protein modifications by occurrence type (natural or hypothetical), 14 | - a set of protein modifications by a keyword (glycoprotein, phosphoprotein, sulfoprotein, …), 15 | - a set of protein modifications by involved components. 16 | 17 | ## Examples 18 | 19 | ```java 20 | // a protein modification by ID 21 | ProteinModification mod = ProteinModificationRegistry.getById(“0001”); 22 | 23 | Set mods; 24 | 25 | // all protein modifications 26 | mods = ProteinModificationRegistry.allModifications(); 27 | 28 | // a set of protein modifications by RESID ID 29 | mods = ProteinModificationRegistry.getByResidId(“AA0151”); 30 | 31 | // a set of protein modifications by PSI-MOD ID 32 | mods = ProteinModificationRegistry.getByPsimodId(“MOD:00305”); 33 | 34 | // a set of protein modifications by PDBCC ID 35 | mods = ProteinModificationRegistry.getByPdbccId(“SEP”); 36 | 37 | // a set of protein modifications by category 38 | mods = ProteinModificationRegistry.getByCategory(ModificationCategory.ATTACHMENT); 39 | 40 | // a set of protein modifications by occurrence type 41 | mods = ProteinModificationRegistry.getByOccurrenceType(ModificationOccurrenceType.NATURAL); 42 | 43 | // a set of protein modifications by a keyword 44 | mods = ProteinModificationRegistry.getByKeyword(“phosphoprotein”); 45 | 46 | // a set of protein modifications by involved components. 47 | mods = ProteinModificationRegistry.getByComponent(Component.of(“FAD”)); 48 | 49 | ``` 50 | 51 | Navigation: 52 | [Home](../README.md) 53 | | [Book 6: The ModFinder Modules](README.md) 54 | | Chapter 2 - How to get a list of supported protein modifications 55 | 56 | Prev: [Chapter 1 : Installation](installation.md) 57 | 58 | Next: [Chapter 3 : How to identify protein modifications in a structure](identify-protein-modifications.md) 59 | -------------------------------------------------------------------------------- /protein-disorder/README.md: -------------------------------------------------------------------------------- 1 | The Protein-Disorder Module of BioJava 2 | ===================================================== 3 | 4 | A tutorial for the protein-disorder module of [BioJava](http://www.biojava.org) 5 | 6 | ## About 7 |
10 | 11 | | 12 |
13 | The protein-disorder module of BioJava provide an API that allows to
14 |
|
20 |
10 | ![]() |
12 |
13 | The protein structure modules of BioJava provide an API that allows to
14 |
|
22 |
94 | MultipleAlignmentEnsemble 95 | | 96 | MultipleAlignment(s) 97 | | 98 | BlockSet(s) 99 | | 100 | Block(s) 101 |102 | 103 | * **MultipleAlignmentEnsemble**: the ensemble is the top level of the hierarchy. 104 | As a top level, it stores information regarding creation properties (algorithm, 105 | version, creation time, etc.), the structures involved in the alignment (Atoms, 106 | structure identifiers, etc.) and cached variables (atomic distance matrices). 107 | It contains a collection of `MultipleAlignment` that share the same properties 108 | stored in the ensemble. This construction allows the storage of alternative 109 | alignments inside the same data structure. 110 | 111 | * **MultipleAlignment**: the `MultipleAlignment` stores the core information of a 112 | multiple structure alignment. It is designed to be the return type of the multiple 113 | structure alignment algorithms. The object contains a collection of `BlockSet` and 114 | it is linked to its parent `MultipleAlignmentEnsemble`. 115 | 116 | * **BlockSet**: the `BlockSet` stores a flexible part of a multiple structure 117 | alignment. A flexible part needs the residue equivalencies involved, contained in 118 | a collection of `Block`, and a transformation matrix for every structure that 119 | describes the 3D superposition of all structures. It is linked to its parent 120 | `MultipleAlignment`. 121 | 122 | * **Block**: the `Block` stores the aligned positions (equivalent residues) of a 123 | `BlockSet` that are in sequentially increasing order. Each `Block` represents a 124 | sequential part of a non-topological alignment, if more than one `Block` is present. 125 | It is linked to its parent `BlockSet`. 126 | 127 | ### The Optimal Alignment 128 | 129 | In the `MultipleAlignment` data structure the aligned residues are stored in a 130 | double List for every `Block`. The indices of the double List are the following: 131 | 132 | ```java 133 | List
59 | The asymmetric unit of hemoglobin PDB ID 1HHO 60 | | 61 |62 | The biological assembly of hemoglobin PDB ID 1HHO 63 | | 64 |
67 | ![]() |
69 |
70 | ![]() |
72 |
82 | The asymmetric unit of bacteriophave GA protein capsid PDB ID 1GAV 83 | | 84 |85 | The biological assembly of bacteriophave GA protein capsid PDB ID 1GAV 86 | | 87 |
90 | ![]() |
92 |
93 | ![]() |
95 |
114 | -Xmx10G 115 |116 | 117 | Note: when loading this structure with 9GB of memory, the Java VM spends a significant amount of time in garbage collection (GC). If you provide more RAM than the minimum requirement, then GC is triggered less often and the biological assembly loads faster. 118 | 119 |
122 | ![]() |
124 |
127 | The biological assembly of the PBCV-1 virus capsid. (image source: RCSB) 128 | | 129 |
31 | -DPDB_DIR=/wherever/you/want/ 32 |33 | 34 | BioJava will also check for a `PDB_DIR` environmental variable. If you launch BioJava from the command line, it can be useful to include `export PDB_DIR=/wherever/you/want` in your `.bashrc` file. 35 | 36 | An alternative is to hard-code the path in this way (but setting it as a property is better style) 37 | 38 | ```java 39 | AtomCache cache = new AtomCache(); 40 | 41 | cache.setPath("/path/to/pdb/files/"); 42 | ``` 43 | 44 | ## File Parsing Parameters 45 | 46 | The AtomCache also provides access to configuring various options that are available during the 47 | parsing of files. The [FileParsingParameters](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/FileParsingParameters.html) 48 | class is the main place to influence the level of detail and as a consequence the speed with which files can be loaded. 49 | 50 | This example turns on the use of chemical components when loading a `Structure`. (See also the [next chapter](chemcomp.md)) 51 | 52 | ```java 53 | AtomCache cache = new AtomCache(); 54 | 55 | cache.setPath("/tmp/"); 56 | 57 | FileParsingParameters params = cache.getFileParsingParams(); 58 | 59 | StructureIO.setAtomCache(cache); 60 | 61 | Structure structure = StructureIO.getStructure("4hhb"); 62 | 63 | ``` 64 | 65 | ## Caching of other SCOP, CATH 66 | 67 | The AtomCache not only provides access to PDB, it can also fetch Structure representations of protein domains, as defined by SCOP and CATH, and the algorithms Protein Domain Parser (PDP) and Domain Parser (DP). 68 | 69 | ```java 70 | // uses a SCOP domain definition 71 | Structure domain1 = StructureIO.getStructure("d4hhba_"); 72 | 73 | // Get a specific protein chain, note: chain IDs are case sensitive, PDB IDs are not. 74 | Structure chain1 = StructureIO.getStructure("4HHB.A"); 75 | 76 | ``` 77 | 78 | There are quite a number of external database IDs that are supported here. See the 79 | AtomCache documentation for more details on the supported options. 80 | 81 | The non-PDB files can be cached at a different location by setting the `PDB_CACHE_DIR` property (with `java -DPDB_CACHE_DIR=...`) or environmental variable. 82 | 83 | 84 | 85 | --- 86 | 87 | Navigation: 88 | [Home](../README.md) 89 | | [Book 3: The Structure Modules](README.md) 90 | | Chapter 4 : Local Installations 91 | 92 | Prev: [Chapter 3 : Structure Data Model](structure-data-model.md) 93 | 94 | Next: [Chapter 5 : Chemical Component Dictionary](chemcomp.md) 95 | -------------------------------------------------------------------------------- /structure/chemcomp.md: -------------------------------------------------------------------------------- 1 | The Chemical Component Dictionary 2 | ================================= 3 | 4 | The [Chemical Component Dictionary](http://www.wwpdb.org/ccd.html) is an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules. 5 | 6 | ### How Does BioJava Decide what Groups Are Amino Acids? 7 | 8 | BioJava utilizes the Chem. Comp. Dictionary to achieve a chemically correct representation of each group. To make it clear how this can work, let's take a look at how [Selenomethionine](http://en.wikipedia.org/wiki/Selenomethionine) and water is dealt with: 9 | 10 | ```java 11 | Structure structure = StructureIO.getStructure("1A62"); 12 | 13 | for (Chain chain : structure.getChains()){ 14 | for (Group group : chain.getAtomGroups()){ 15 | if ( group.getPDBName().equals("MSE") || group.getPDBName().equals("HOH")){ 16 | System.out.println(group.getPDBName() + " is a group of type " + group.getType()); 17 | } 18 | } 19 | } 20 | ``` 21 | 22 | This will give this output: 23 | 24 |
25 | MSE is a group of type amino 26 | MSE is a group of type amino 27 | MSE is a group of type amino 28 | HOH is a group of type hetatm 29 | HOH is a group of type hetatm 30 | HOH is a group of type hetatm 31 | ... 32 |33 | 34 | As you can see, although MSE is flaged as HETATM in the PDB file, BioJava still represents it correctly as an amino acid. They key is that the [definition file for MSE](http://www.rcsb.org/pdb/files/ligand/MSE.cif) flags it as "L-PEPTIDE LINKING", which is being used by BioJava. 35 | 36 | Note: Selenomethionine is a naturally occurring amino acid containing selenium. It has the ID MSE in the Chemical Component Dictionary. 37 | 38 | 39 | ### How to Access Chemical Component Definitions 40 | 41 | By default BioJava will retrieve the full chemical component definitions provided by the PDB. That way BioJava makes sure that the user gets a correct representation e.g. distinguish ligands from the polypeptide chain, correctly resolve chemically modified residues, etc. 42 | 43 | The behaviour is configurable by setting a property in the `ChemCompGroupFactory` singleton: 44 | 45 | 1. Use a minimal built-in set of **Chemical Component Definitions**. Will only deal with most frequent cases of chemical components. Does not guarantee a correct representation, but it is fast and does not require network access. 46 | ```java 47 | ChemCompGroupFactory.setChemCompProvider(new ReducedChemCompProvider()); 48 | ``` 49 | 2. Load all **Chemical Component Definitions** at startup (slow startup, but then no further delays later on, requires more memory) 50 | ```java 51 | ChemCompGroupFactory.setChemCompProvider(new AllChemCompProvider()); 52 | ``` 53 | 3. Fetch missing **Chemical Component Definitions** on the fly (small download and parsing delays every time a new chemical compound is found). Default behaviour since 4.2.0. Note that the chemical component files are cached in the local file system for subsequent uses. 54 | ```java 55 | ChemCompGroupFactory.setChemCompProvider(new DownloadChemCompProvider()); 56 | ``` 57 | 58 | 59 | 60 | 61 | --- 62 | 63 | Navigation: 64 | [Home](../README.md) 65 | | [Book 3: The Structure Modules](README.md) 66 | | Chapter 5 : Chemical Component Dictionary 67 | 68 | Prev: [Chapter 4 : Local Installations](caching.md) 69 | 70 | Next: [Chapter 6 : Work with mmCIF/PDBx Files](mmcif.md) 71 | -------------------------------------------------------------------------------- /structure/contact-map.md: -------------------------------------------------------------------------------- 1 | # Finding contacts between atoms in a protein: contact maps 2 | 3 | Contacts are a useful tool to analyse protein structures. They simplify the 3-Dimensional view of the structures into a 2-Dimensional set of contacts between its atoms or its residues. The representation of the contacts in a matrix is known as the contact map. Many protein structure analysis and prediction efforts are done by using contacts. For instance they can be useful for: 4 | 5 | + development of structural alignment algorithms [Holm 1993][] [Caprara 2004][] 6 | + automatic domain identification [Alexandrov 2003][] [Emmert-Streib 2007][] 7 | + structural modelling by extraction of contact-based empirical potentials [Benkert 2008][] 8 | + structure prediction via contact prediction from sequence information [Jones 2012][] 9 | 10 | ## Getting the contact map of a protein chain 11 | 12 | This code snippet will produce the set of contacts between all C alpha atoms for chain A of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT): 13 | 14 | ```java 15 | AtomCache cache = new AtomCache(); 16 | StructureIO.setAtomCache(cache); 17 | 18 | Structure structure = StructureIO.getStructure("1SMT"); 19 | 20 | Chain chain = structure.getChainByPDB("A"); 21 | 22 | // we want contacts between Calpha atoms only 23 | String[] atoms = {" CA "}; 24 | // the distance cutoff we use is 8A 25 | AtomContactSet contacts = StructureTools.getAtomsInContact(chain, atoms, 8.0); 26 | 27 | System.out.println("Total number of CA-CA contacts: "+contacts.size()); 28 | 29 | 30 | ``` 31 | 32 | The algorithm to find the contacts uses spatial hashing without need to calculate a full distance matrix, thus it scales nicely. 33 | 34 | ## Getting the contacts between two protein chains 35 | 36 | One can also find the contacting atoms between two protein chains. For instance the following code finds the contacts between the first 2 chains of PDB entry [1SMT](http://www.rcsb.org/pdb/explore.do?structureId=1SMT): 37 | 38 | ```java 39 | AtomCache cache = new AtomCache(); 40 | StructureIO.setAtomCache(cache); 41 | 42 | Structure structure = StructureIO.getStructure("1SMT"); 43 | 44 | AtomContactSet contacts = 45 | StructureTools.getAtomsInContact(structure.getChain(0), structure.getChain(1), 5, false); 46 | 47 | System.out.println("Total number of atom contacts: "+contacts.size()); 48 | 49 | // the list of atom contacts can be reduced to a list of contacts between groups: 50 | GroupContactSet groupContacts = new GroupContactSet(contacts); 51 | ``` 52 | 53 | 54 | See [DemoContacts](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoContacts.java) for a fully working demo of the examples above. 55 | 56 | 57 | 58 | [Holm 1993]: http://www.biomedcentral.com/pubmed/8377180 59 | [Caprara 2004]: http://www.biomedcentral.com/pubmed/15072687 60 | [Alexandrov 2003]: http://www.biomedcentral.com/pubmed/12584135 61 | [Emmert-Streib 2007]: http://www.biomedcentral.com/pubmed/17608939 62 | [Benkert 2008]: http://www.biomedcentral.com/pubmed/17932912 63 | [Jones 2012]: http://www.ncbi.nlm.nih.gov/pubmed/22101153 64 | 65 | 66 | 67 | --- 68 | 69 | Navigation: 70 | [Home](../README.md) 71 | | [Book 3: The Structure Modules](README.md) 72 | | Chapter 12 : Contacts Within a Chain and between Chains 73 | 74 | Prev: [Chapter 11 : Accessible Surface Areas](asa.md) 75 | 76 | Next: [Chapter 13 - Finding all Interfaces in Crystal: Crystal Contacts](crystal-contacts.md) 77 | -------------------------------------------------------------------------------- /structure/crystal-contacts.md: -------------------------------------------------------------------------------- 1 | # How to find all crystal contacts in a PDB structure 2 | 3 | ## Why crystal contacts? 4 | 5 | A protein structure is determined by X-ray diffraction from a protein crystal, i.e. an infinite lattice of molecules. Thus the end result of the diffraction experiment is a crystal lattice and not just a single molecule. However the PDB file only contains the coordinates of the Asymmetric Unit (AU), defined as the minimum unit needed to reconstruct the full crystal using symmetry operators. 6 | 7 | Looking at the AU alone is not enough to understand the crystal structure. For instance the biologically relevant assembly (known as the Biological Unit) can occur through a symmetry operator that can be found looking at the crystal contacts. See for instance [1M4N](http://www.rcsb.org/pdb/explore.do?structureId=1M4N): its biological unit is a dimer that happens through a 2-fold operator and is the largest interface found in the crystal. 8 | 9 | Looking at crystal contacts can also be important in order to assess the quality and reliability of the deposited PDB model: an AU can look perfectly fine but then upon reconstruction of the lattice the molecules can be clashing, which indicates that something is wrong in the model. 10 | 11 | 12 | ## Getting the set of unique contacts in the crystal lattice 13 | 14 | This code snippet will produce a list of all non-redundant interfaces present in the crystal lattice of PDB entry [1SMT](https://www.rcsb.org/structure/1SMT): 15 | 16 | ```java 17 | AtomCache cache = new AtomCache(); 18 | 19 | StructureIO.setAtomCache(cache); 20 | 21 | Structure structure = StructureIO.getStructure("1SMT"); 22 | 23 | CrystalBuilder cb = new CrystalBuilder(structure); 24 | 25 | // 6 is the distance cutoff to consider 2 atoms in contact 26 | StructureInterfaceList interfaces = cb.getUniqueInterfaces(6); 27 | 28 | System.out.println("The crystal contains "+interfaces.size()+" unique interfaces"); 29 | 30 | // this calculates the buried surface areas of all interfaces and sorts them by areas 31 | interfaces.calcAsas(3000, 1, -1); 32 | 33 | // we can get the largest interface in the crystal and look at its area 34 | interfaces.get(1).getTotalArea(); 35 | 36 | ``` 37 | 38 | An interface is defined here as any 2 chains with at least a pair of atoms within the given distance cutoff (6 A in the example above). 39 | 40 | The algorithm to find all unique interfaces in the crystal works roughly like this: 41 | + Reconstructs the full unit cell by applying the matrix operators of the corresponding space group to the Asymmetric Unit. 42 | + Searches all cells around the original one by applying crystal translations, if any 2 chains in that search is found to contact then the new contact is added to the final list. 43 | + The search is performend without repeating redundant symmetry operators, making sure that if a contact is found then it is a unique contact. 44 | 45 | See [DemoCrystalInterfaces](https://github.com/biojava/biojava/blob/master/biojava-structure/src/main/java/demo/DemoCrystalInterfaces.java) for a fully working demo of the example above. 46 | 47 | ## Clustering the interfaces 48 | One can also cluster the interfaces based on their similarity. The similarity is measured through contact overlap: number of common contacts over average number of contact in both chains. The clustering can be done as following: 49 | 50 | ```java 51 | List
11 | ![]() |
13 |
14 | (Top) The structure 1dan contains four chains. 15 | 16 | (Bottom) These chains are broken up into six SCOP domains. The green chain L becomes 3 domains, while a combination of chains U (red) and T (orange) go to form the central purpal domain. 17 | |
18 |
169 | This will show the following 170 | | 171 | 172 |173 | and the text: 174 | | 175 |
179 | ![]() |
181 |
182 | 183 | 184 | got 4 domains 185 | * domain 1danH01 has # segments: 2 color: red 186 | * CathSegment [segmentId=1, start=16, stop=27, length=12, sequenceHeader=null, sequence=null] 187 | * CathSegment [segmentId=2, start=121, stop=232, length=112, sequenceHeader=null, sequence=null] 188 | * domain 1danH02 has # segments: 2 color: green 189 | * CathSegment [segmentId=1, start=28, stop=120, length=93, sequenceHeader=null, sequence=null] 190 | * CathSegment [segmentId=2, start=233, stop=246, length=14, sequenceHeader=null, sequence=null] 191 | * domain 1danU00 has # segments: 1 color: blue 192 | * CathSegment [segmentId=1, start=91, stop=210, length=120, sequenceHeader=null, sequence=null] 193 | * domain 1danT00 has # segments: 1 color: yellow 194 | * CathSegment [segmentId=1, start=6, stop=80, length=75, sequenceHeader=null, sequence=null] 195 |196 | |
197 |
21 | ![]() |
23 | 24 | The crystal structure of human deoxyhaemoglobin PDB ID 4HHB (image source: RCSB) 25 | |
36 | -DPDB_DIR=/wherever/you/want/ 37 |38 | 39 | ## Memory Consumption 40 | 41 | Talking about startup properties, it is also good to mention the fact that many PDB entries are large molecules and the default 64k memory allowance for Java applications is not sufficient in many cases. BioJava contains several built-in caches which automatically adjust to the available memory. As such, the more memory you grant your Java applicaiton, the better it can utilize the caches and the better the performance will be. Change the maximum heap space of your Java VM with this startup parameter: 42 | 43 |
44 | -Xmx1G 45 |46 | 47 | ## A Quick 3D View 48 | 49 | If you have the *biojava-structure-gui* module installed, you can quickly visualise a [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) via this: 50 | 51 | ```java 52 | public static void main(String[] args) throws Exception { 53 | Structure struc = StructureIO.getStructure("4hhb"); 54 | 55 | StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol(); 56 | 57 | jmolPanel.setStructure(struc); 58 | 59 | // send some commands to Jmol 60 | jmolPanel.evalString("select * ; color chain;"); 61 | jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; "); 62 | jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;"); 63 | } 64 | ``` 65 | 66 | This will result in the following view: 67 | 68 |
71 | ![]() |
73 | 74 | The StructureAlignmentJmol class provides a wrapper for the Jmol viewer and provides a bridge to BioJava, so Structure objects can be sent to Jmol for visualisation. 75 | | 76 |
34 | mvn package 35 |36 | 37 | on your project, the BioJava dependencies will be automatically downloaded and installed for you. 38 | 39 | ### (Optional) Configuration 40 | 41 | BioJava can be configured through several properties: 42 | 43 | | Property | Description | 44 | | --- | --- | 45 | | `PDB_DIR` | Directory for caching structure files from the PDB. Mirrors the PDB's FTP server directory structure, with `PDB_DIR` equivalent to ftp://ftp.wwpdb.org/pub/pdb/. Default: temp directory | 46 | | `PDB_CACHE_DIR` | Cache directory for other files related to the structure package. Default: temp directory | 47 | 48 | These can be set either as java properties or as environmental variables. For example: 49 | 50 | ``` 51 | # This could be added to .bashrc 52 | export PDB_DIR=... 53 | # Or override for a particular execution 54 | java -DPDB_DIR=... -cp ... 55 | ``` 56 | 57 | Note that your IDE may ignore `.bashrc` settings, but should have a preference for passing VM arguments. 58 | 59 | 60 | 61 | --- 62 | 63 | Navigation: 64 | [Home](../README.md) 65 | | [Book 3: The Structure Modules](README.md) 66 | | Chapter 1 : Installation 67 | 68 | Next: [Chapter 2 : First Steps](firststeps.md) 69 | -------------------------------------------------------------------------------- /structure/lists.md: -------------------------------------------------------------------------------- 1 | # Lists of PDB IDs and PDB Status Information 2 | 3 | ## Get a list of all current PDB IDs 4 | 5 | The following code connects to one of the PDB servers and fetches a list of all current PDB IDs. 6 | 7 | ```java 8 | SortedSet
39 | -DPDB_DIR=/wherever/you/want/ 40 |41 | 42 | ## Switching AtomCache to use different file types 43 | 44 | By default BioJava is using the BCIF file format for parsing data. In order to switch it to use mmCIF, we can take control over 45 | the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/util/AtomCache.html) which 46 | manages your PDB ([and btw. also SCOP, CATH](externaldb.md)) installations. 47 | 48 | ```java 49 | AtomCache cache = new AtomCache(); 50 | 51 | cache.setFiletype(StructureFiletype.CIF); 52 | 53 | // if you struggled to set the PDB_DIR property correctly in the previous step, 54 | // you could set it manually like this: 55 | cache.setPath("/tmp/"); 56 | 57 | StructureIO.setAtomCache(cache); 58 | 59 | Structure structure = StructureIO.getStructure("4HHB"); 60 | 61 | // and let's count how many chains are in this structure. 62 | System.out.println(structure.getChains().size()); 63 | ``` 64 | 65 | See other supported file types in the `StructureFileType` enum. 66 | 67 | ## URL based parsing of files 68 | 69 | StructureIO can also access files via URLs and fetch the data dynamically. E.g. the following code shows how to load a file from a remote server. 70 | 71 | ```java 72 | String u = "http://ftp.wwpdb.org/pub/pdb/data/biounit/mmCIF/divided/nw/4nwr-assembly1.cif.gz"; 73 | Structure s = StructureIO.getStructure(u); 74 | System.out.println(s); 75 | ``` 76 | 77 | ### Local URLs 78 | BioJava can also access local files, by specifying the URL as 79 | 80 |
81 | file:///path/to/local/file 82 |83 | 84 | 85 | ## Low Level Access 86 | 87 | You can load a BioJava `Structure` object using the ciftools-java parser with: 88 | 89 | ```java 90 | InputStream inStream = new FileInputStream(fileName); 91 | // now get the protein structure. 92 | Structure cifStructure = CifStructureConverter.fromInputStream(inStream); 93 | ``` 94 | 95 | ## I Loaded a Structure Object, What Now? 96 | 97 | BioJava provides a number of algorithms and visualisation tools that you can use to further analyse the structure, or look at it. Here a couple of suggestions for further reads: 98 | 99 | + [The BioJava Cookbook for protein structures](http://biojava.org/wiki/BioJava:CookBook#Protein_Structure) 100 | + How does BioJava [represent the content](structure-data-model.md) of a PDB/mmCIF file? 101 | + How to calculate a protein structure alignment using BioJava: [tutorial](alignment.md) or [cookbook](http://biojava.org/wiki/BioJava:CookBook:PDB:align) 102 | + [How to work with Groups (AminoAcid, Nucleotide, Hetatom)](http://biojava.org/wiki/BioJava:CookBook:PDB:groups) 103 | 104 | ## Further reading 105 | 106 | See the [http://mmcif.rcsb.org/](http://mmcif.rcsb.org/) site for more documentation on mmcif. 107 | 108 | 109 | 110 | 111 | 112 | [Westbrook 2000]: http://www.ncbi.nlm.nih.gov/pubmed/10842738 "Westbrook JD and Bourne PE. STAR/mmCIF: an ontology for macromolecular structure. Bioinformatics 2000 Feb; 16(2) 159-68. pmid:10842738." 113 | 114 | [Westbrook 2003]: http://www.ncbi.nlm.nih.gov/pubmed/12647386 "Westbrook JD and Fitzgerald PM. The PDB format, mmCIF, and other data formats. Methods Biochem Anal 2003; 44 161-79. pmid:12647386." 115 | 116 | 117 | 118 | 119 | --- 120 | 121 | Navigation: 122 | [Home](../README.md) 123 | | [Book 3: The Structure Modules](README.md) 124 | | Chapter 6 : Work with mmCIF/PDBx Files 125 | 126 | Prev: [Chapter 5 : Chemical Component Dictionary](chemcomp.md) 127 | 128 | Next: [Chapter 7 : SEQRES and ATOM Records](seqres.md) 129 | -------------------------------------------------------------------------------- /structure/secstruc.md: -------------------------------------------------------------------------------- 1 | Protein Secondary Structure 2 | =========================== 3 | 4 | ## What is Protein Secondary Structure? 5 | 6 | Protein secondary structure (SS) is the general three-dimensional form of local segments of proteins. 7 | Secondary structure can be formally defined by the pattern of hydrogen bonds of the protein 8 | (such as alpha helices and beta sheets) that are observed in an atomic-resolution structure. 9 | 10 | More specifically, the secondary structure is defined by the patterns of hydrogen bonds formed between 11 | amine hydrogen (-NH) and carbonyl oxygen (C=O) atoms contained in the backbone peptide bonds of the protein. 12 | 13 | For more info see the Wikipedia article 14 | on [protein secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure). 15 | 16 | ## Secondary Structure Annotation 17 | 18 | ### Information Sources 19 | 20 | There are various ways to obtain the SS annotation of a protein structure: 21 | 22 | - **Authors assignment**: the authors of the structure describe the SS, usually identifying helices 23 | and beta-sheets, and they assign the corresponding type to each residue involved. The authors assignment 24 | can be found in the `PDB` and `mmCIF` file formats deposited in the PDB, and it can be parsed in **BioJava** 25 | when a `Structure` is loaded. 26 | 27 | - **Assignment from Atom coordinates**: there exist various programs to assign the SS of a protein. 28 | The algorithms use the atom coordinates of the aminoacids to determine hydrogen bonds and geometrical patterns 29 | that define the different types of protein secondary structure. One of the first and most popular algorithms 30 | is `DSSP` (Dictionary of Secondary Structure of Proteins). **BioJava** has an implementation of the algorithm, 31 | written originally in C++, which will be described in the next section. 32 | 33 | - **Prediction from sequence**: Other algorithms use only the aminoacid sequence (primary structure) of the protein, 34 | nd predict the SS using the SS propensities of each aminoacid and multiple alignments with homologous sequences 35 | (i.e. [PSIPRED](http://bioinf.cs.ucl.ac.uk/psipred/)). At the moment **BioJava** does not have an implementation 36 | of this type, which would be more suitable for the sequence and alignment modules. 37 | 38 | ### Secondary Structure Types 39 | 40 | Following the `DSSP` convention, **BioJava** defines 8 types of secondary structure: 41 | 42 | E = extended strand, participates in β ladder 43 | B = residue in isolated β-bridge 44 | H = α-helix 45 | G = 3-helix (3-10 helix) 46 | I = 5-helix (π-helix) 47 | T = hydrogen bonded turn 48 | S = bend 49 | _ = loop (any other type) 50 | 51 | ## Parsing Secondary Structure in BioJava 52 | 53 | Currently there exist two alternatives to parse the secondary structure in **BioJava**: either from the PDB/mmCIF 54 | files of deposited structures (author assignment) or from the output file of a DSSP prediction. Both file types 55 | can be obtained from the PDB serevers, if available, so they can be automatically fetched by BioJava. 56 | 57 | As an example,you can find here the links of the structure **5PTI** to its 58 | [PDB file](http://www.rcsb.org/pdb/files/5PTI.pdb) (search for the HELIX and SHEET lines) and its 59 | [DSSP file](http://www.rcsb.org/pdb/files/5PTI.dssp). 60 | 61 | Note that the DSSP prediction output is more detailed and complete than the authors assignment. 62 | The choice of one or the other will depend on the use case. 63 | 64 | Below you can find some examples of how to parse and assign the SS of a `Structure`: 65 | 66 | ```java 67 | String pdbID = "5pti"; 68 | FileParsingParameters params = new FileParsingParameters(); 69 | //Only change needed to the normal Structure loading 70 | params.setParseSecStruc(true); //this is false as DEFAULT 71 | 72 | AtomCache cache = new AtomCache(); 73 | cache.setFileParsingParams(params); 74 | 75 | //The loaded Structure contains the SS assigned 76 | Structure s = cache.getStructure(pdbID); 77 | 78 | //If the more detailed DSSP prediction is required call this afterwards 79 | DSSPParser.fetch(pdbID, s, true); //Second parameter true overrides the previous SS 80 | ``` 81 | 82 | For more examples search in the **demo** package for `DemoLoadSecStruc`. 83 | 84 | ## Assignment of Secondary Structure in BioJava 85 | 86 | ### Algorithm 87 | 88 | The algorithm implemented in BioJava for the assignment of SS is `DSSP`. It is described in the paper from 89 | [Kabsch W. & Sander C. in 1983](http://onlinelibrary.wiley.com/doi/10.1002/bip.360221211/abstract) 90 | [](http://www.ncbi.nlm.nih.gov/pubmed/6667333). 91 | A brief explanation of the algorithm and the output format can be found 92 | [here](http://swift.cmbi.ru.nl/gv/dssp/DSSP_3.html). 93 | 94 | The interface is very easy: a single method, named *calculate()*, calculates the SS and can assign it to the 95 | input Structure overriding any previous annotation, like in the DSSPParser. An example can be found below: 96 | 97 | ```java 98 | String pdbID = "5pti"; 99 | AtomCache cache = new AtomCache(); 100 | 101 | //Load structure without any SS assignment 102 | Structure s = cache.getStructure(pdbID); 103 | 104 | //Predict and assign the SS of the Structure 105 | SecStrucCalc ssp = new SecStrucCalc(); //Instantiation needed 106 | ssp.calculate(s, true); //true assigns the SS to the Structure 107 | ``` 108 | 109 | BioJava Class: 110 | [org.biojava.nbio.structure.secstruc.SecStrucCalc](http://www.biojava.org/docs/api/org/biojava/nbio/structure/secstruc/SecStrucCalc.html) 111 | 112 | ### Storage and Data Structures 113 | 114 | Because there are different sources of SS annotation, the data structure in **BioJava** that stores SS assignments 115 | has two levels. The top level `SecStrucInfo` is very general and only contains two properties: **assignment** 116 | (String describing the source of information) and **type** the SS type. 117 | 118 | However, there is an extended container `SecStrucState`, which is a subclass of `SecStrucInfo`, that stores 119 | all the information of the hydrogen bonding, turns, bends, etc. used for the SS prediction and present in the 120 | DSSP output file format. This information is only used in certain applications, and that is the reason for the 121 | more general `SecStrucInfo` class being used by default. 122 | 123 | In order to access the SS information of a `Structure`, the `SecStrucInfo` object needs to be obtained from the 124 | `Group` properties. Below you find an example of how to access and print residue by residue the SS information of 125 | a `Structure`: 126 | 127 | ```java 128 | //This structure should have SS assigned (by any of the methods described) 129 | Structure s; 130 | 131 | for (Chain c : s.getChains()) { 132 | for (Group g: c.getAtomGroups()){ 133 | if (g.hasAminoAtoms()){ //Only AA store SS 134 | //Obtain the object that stores the SS 135 | SecStrucInfo ss = (SecStrucInfo) g.getProperty(Group.SEC_STRUC); 136 | //Print information: chain+resn+name+SS 137 | System.out.println(c.getChainID()+" "+ 138 | g.getResidueNumber()+" "+ 139 | g.getPDBName()+" -> "+ss); 140 | } 141 | } 142 | } 143 | ``` 144 | 145 | ### Output Formats 146 | 147 | Once the SS has been assigned (either loaded or calculated), there are some easy formats to visualize it in **BioJava**: 148 | 149 | - **DSSP format**: the SS can be printed as a DSSP oputput file format, following the standards so that it can be 150 | parsed again. It is the safest way to serialize a SS annotation and recover it later, but it is probably the most 151 | complicated to visualize. 152 | 153 |
154 | # RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 155 | 1 1 A R 0 0 168 0, 0.0 54,-0.1 0, 0.0 5,-0.1 0.000 360.0 360.0 360.0 139.2 32.2 14.7 -11.8 156 | 2 2 A P > - 0 0 45 0, 0.0 3,-1.8 0, 0.0 4,-0.3 -0.194 360.0-122.0 -61.4 144.9 34.9 13.6 -9.4 157 | 3 3 A D G > S+ 0 0 122 1,-0.3 3,-1.6 2,-0.2 4,-0.2 0.790 108.3 71.4 -62.8 -28.5 35.8 10.0 -9.5 158 | 4 4 A F G > S+ 0 0 26 1,-0.3 3,-1.7 2,-0.2 -1,-0.3 0.725 83.7 70.4 -64.1 -23.3 35.0 9.7 -5.9 159 |160 | 161 | - **FASTA format**: simple format that prints the SS type of each residue sequentially in the order of the aminoacids. 162 | It is the easiest to visualize, but the less informative of all. 163 | 164 |
165 | >5PTI_SS-annotation 166 | GGGGS S EEEEEEETTTTEEEEEEE SSS SS BSSHHHHHHHH 167 |168 | 169 | - **Helix Summary**: similar to the FASTA format, but contain also information about the helical turns. 170 | 171 |
172 | 3 turn: >>><<< 173 | 4 turn: >444< >>>>XX<<<< 174 | 5 turn: >5555< 175 | SS: GGGGS S EEEEEEETTTTEEEEEEE SSS SS BSSHHHHHHHH 176 | AA: RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA 177 |178 | 179 | - **Secondary Structure Elements**: another way to visualize the SS annotation is by compacting those sequential residues that share the same SS type and assigning an ID to the range. In this way, a structure can be described by 180 | a collection of helices, strands, turns, etc. and each one of the elements can be identified by an ID (i.e. helix 1 (H1), beta-strand 6 (E6), etc). 181 | 182 |
183 | G1: 3 - 6 184 | S1: 7 - 7 185 | S2: 13 - 13 186 | E1: 18 - 24 187 | T1: 25 - 28 188 | E2: 29 - 35 189 | S3: 37 - 39 190 | S4: 42 - 43 191 | B1: 45 - 45 192 | S5: 46 - 47 193 | H1: 48 - 55 194 |195 | 196 | You can find examples of how to get the different file formats in the class `DemoSecStrucPred` in the **demo** 197 | package. 198 | 199 | ### Example 200 | 201 | Use dependencies from maven 202 | 203 | ```xml 204 |
25 | Seqres groups -> sequence that has been used in the experiment 26 | Atom groups -> subset of Seqres groups for which coordinates could be obtained 27 |28 | 29 | The *mmCIF/PDBx* file format contains the information how the Seqres and atom records are mapped onto each other. However the *PDB format* does not clearly specify how to resolve this mapping. BioJava contains a utility class that maps the Seqres to the Atom records when parsing PDB files. This class performs an alignment using dynamic programming, which can slow down the parsing process. If you do not require the precise Seqres to Atom mapping, you can turn it off like this: 30 | 31 | ```java 32 | AtomCache cache = new AtomCache(); 33 | 34 | FileParsingParameters params = cache.getFileParsingParams(); 35 | 36 | params.setAlignSeqRes(false); 37 | 38 | Structure structure = StructureIO.getStructure(...); 39 | 40 | ``` 41 | 42 | ## Accessing Seqres and Atom Groups 43 | 44 | By default BioJava loads both the Seqres and Atom groups into the [Chain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Chain.html) 45 | objects. 46 | 47 |
48 | Chain -> Seqres groups 49 | -> Atom groups 50 |51 | 52 | Groups that are part of the Seqres sequence as well as of the Atom records are mapped onto each other. This means you 53 | can iterate over all Seqres groups in a chain and check, if they have observed atoms. 54 | 55 | ## Mapping from Uniprot to Atom Records 56 | 57 | The mapping between PDB and UniProt changes over time, due to the dynamic nature of biological data. The [PDBe](http://www.pdbe.org) has a project that provides up-to-date mappings between the two databases, the [SIFTs](http://www.ebi.ac.uk/pdbe/docs/sifts/) project. 58 | 59 | BioJava contains a parser for the SIFTs XML files. The [SiftsMappingProvider](http://www.biojava.org/docs/api/org/biojava/nbio/structure/io/sifts/SiftsMappingProvider.html) also acts similar to the AtomCache class, that we [discussed earlier](caching.md) and can automatically download and locally install SIFTs files. 60 | 61 | Here, how to request the mapping for one particular PDB ID. 62 | 63 | ```java 64 | List
83 | C protein 84 | Segment: 1gc1_C_1_181 1 181 85 | SiftsResidue [pdbResNum=1, pdbResName=LYS, chainId=C, uniProtResName=K, uniProtPos=26, naturalPos=1, seqResName=LYS, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 86 | SiftsResidue [pdbResNum=2, pdbResName=LYS, chainId=C, uniProtResName=K, uniProtPos=27, naturalPos=2, seqResName=LYS, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 87 | SiftsResidue [pdbResNum=3, pdbResName=VAL, chainId=C, uniProtResName=V, uniProtPos=28, naturalPos=3, seqResName=VAL, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 88 | SiftsResidue [pdbResNum=4, pdbResName=VAL, chainId=C, uniProtResName=V, uniProtPos=29, naturalPos=4, seqResName=VAL, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 89 | SiftsResidue [pdbResNum=5, pdbResName=LEU, chainId=C, uniProtResName=L, uniProtPos=30, naturalPos=5, seqResName=LEU, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 90 | SiftsResidue [pdbResNum=6, pdbResName=GLY, chainId=C, uniProtResName=G, uniProtPos=31, naturalPos=6, seqResName=GLY, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 91 | SiftsResidue [pdbResNum=7, pdbResName=LYS, chainId=C, uniProtResName=K, uniProtPos=32, naturalPos=7, seqResName=LYS, pdbId=1gc1, uniProtAccessionId=P01730, notObserved=false] 92 | ... 93 |94 | 95 | As you can see for each residue in the Uniprot / PDB sequence the matching counterpart is provided (if there is one). 96 | 97 | 98 | 99 | 100 | 101 | --- 102 | 103 | Navigation: 104 | [Home](../README.md) 105 | | [Book 3: The Structure Modules](README.md) 106 | | Chapter 7 : SEQRES and ATOM Records 107 | 108 | Prev: [Chapter 6 : Work with mmCIF/PDBx Files](mmcif.md) 109 | 110 | Next: [Chapter 8 : Structure Alignments](alignment.md) 111 | -------------------------------------------------------------------------------- /structure/special.md: -------------------------------------------------------------------------------- 1 | # Special Cases When Working with Protein Structures 2 | 3 | ## Alternate Locations 4 | 5 | Some PDB entries contain alternate conformations for parts of a structure or a group. BioJava merges alternate conformations into a single group, for which alternative groups are available. 6 | 7 | ```java 8 | 9 | Structure s = StructureIO.getStructure("1AAC"); 10 | 11 | Chain a = s.getChainByPDB("A"); 12 | 13 | Group g = a.getGroupByPDB( ResidueNumber.fromString("27")); 14 | 15 | System.out.println(g); 16 | for (Atom atom : g.getAtoms()) { 17 | System.out.print(atom.toPDB()); 18 | } 19 | 20 | 21 | int pos = 0; 22 | for (Group alt: g.getAltLocs()) { 23 | pos++; 24 | System.out.println("altLoc: " + pos + " " + alt); 25 | for (Atom atom : alt.getAtoms()) { 26 | System.out.print(atom.toPDB()); 27 | } 28 | } 29 | ``` 30 | 31 | ## Insertion Codes 32 | 33 | Insertion codes were introduced in the PDB, when people wanted to compare the "same" protein between different species. As it turned out the "same" protein was not showing exactly the same sequence in different species and in some cases insertions were found, resulting in a longer sequences. For the comparison of the proteins the numbering was considered important to be preserved. This was so one could say that for example "HIS 75" is an important residue. To make up for the mismatch in the lengths of the sequences insertion codes were introduced. As a consequence, in PDB, a particular residue is identified uniquely by three data items: chain identifier, residue number, and insertion code. 34 | 35 | BioJava contains the ResidueNumber object to help with characterizing each group in a file. PDB ID 1IGY contains some extra residues around chain B position 82. BioJava can represent these like this: 36 | 37 | ```java 38 | Structure s1 = StructureIO.getStructure("1IGY"); 39 | 40 | Chain b = s1.getChainByPDB("B"); 41 | 42 | for (Group g : b.getAtomGroups()){ 43 | System.out.println(g.getResidueNumber() + " " + g.getPDBName() + " " + g.getResidueNumber().getInsCode()); 44 | } 45 | 46 | ``` 47 | 48 | This will display the following table: (residuenumber, name, insertion code) 49 | 50 | ``` 51 | ... 52 | 81 HIS null 53 | 82 LEU null 54 | 82A SER A 55 | 82B SER B 56 | 82C LEU C 57 | 83 THR null 58 | 84 SER null 59 | ... 60 | ``` 61 | 62 | 63 | ## Chromophores 64 | 65 | A [chromophore](http://en.wikipedia.org/wiki/Chromophore) is the part of a molecule responsible for its color. Some proteins, such as GFP contain a chromopohre that consists of three modified residues. BioJava represents this as a single group in terms of atoms, however as three amino acids when creating the amino acid sequences. 66 | 67 | ```java 68 | 69 | 70 | // make sure we download chemical component definitions 71 | // which is required for correctly representing the chromophore 72 | FileParsingParameters params = new FileParsingParameters(); 73 | params.setLoadChemCompInfo(true); 74 | 75 | // now register the parameters in the cache 76 | AtomCache cache = new AtomCache(); 77 | cache.setFileParsingParams(params); 78 | StructureIO.setAtomCache(cache); 79 | 80 | 81 | // request a GFP protein 82 | Structure s1 = StructureIO.getStructure("2pxw"); 83 | 84 | // and print out the internals 85 | System.out.println(s1.getPDBHeader().toPDB()); 86 | 87 | // chromophore is at PDB residue number 66 88 | for ( Chain c : s1.getChains()) { 89 | 90 | System.out.println("Chain " + c.getChainID() + 91 | " internal " + c.getInternalChainID() + 92 | " ligands " + c.getAtomLigands().size()); 93 | System.out.println(" 10 20 30 40 50 60"); 94 | System.out.println("1234567890123456789012345678901234567890123456789012345678901234567890"); 95 | System.out.println(c.getAtomSequence()); 96 | 97 | int pos = 0 ; 98 | for (Group g: c.getAtomGroups()) { 99 | pos++; 100 | System.out.println(pos + " " + g.getResidueNumber() + " " + g.getPDBName() + " " + g.getType() + " " + g.getChemComp().getOne_letter_code() + " " + g.getChemComp().getType() ); 101 | } 102 | } 103 | ``` 104 | 105 | This will give this output, note 'DYG' at position 63. 106 | 107 | ``` 108 | 60 109 | ...01234567890 110 | ...AAFDYGNRVFTEY... 111 | ``` 112 | 113 | DYG is an unusual group - it has 3 characters as a result of .getOne_letter_code() 114 | 115 | ``` 116 | ... 117 | 62 65 PHE amino F L-PEPTIDE LINKING 118 | 63 66 DYG amino DYG L-PEPTIDE LINKING 119 | 64 69 ASN amino N L-PEPTIDE LINKING 120 | ... 121 | ``` 122 | 123 | ## Microheterogeneity 124 | 125 | 126 | 127 | 128 | 129 | --- 130 | 131 | Navigation: 132 | [Home](../README.md) 133 | | [Book 3: The Structure Modules](README.md) 134 | | Chapter 17 : Special Cases 135 | 136 | Prev: [Chapter 15 : Protein Secondary Structure](secstruc.md) 137 | 138 | Next: [Chapter 18 : Status Information](lists.md) 139 | -------------------------------------------------------------------------------- /structure/structure-data-model.md: -------------------------------------------------------------------------------- 1 | # The BioJava-Structure Data Model 2 | 3 | A biologically and chemically meaningful data representation of PDB/mmCIF. 4 | 5 | ## The Basics 6 | 7 | BioJava at its core is a collection of file parsers and (in some cases) data models to represent frequently used biological data. The protein-structure modules represent macromolecular data in a way that should make it easy to work with. The representation is essentially independent of the underlying file format and the user can chose to work with either PDB or mmCIF files and still get an almost identical data representation. (There can be subtile differences between PDB and mmCIF data, for example the atom indices in a few entries are not 100% identical) 8 | 9 | ## The Main Hierarchy 10 | 11 | BioJava provides a flexible data structure for managing protein structural data. The 12 | [Structure](http://www.biojava.org/docs/api/org/biojava/nbio/structure/Structure.html) class is the main container. 13 | 14 | A `Structure` has a hierarchy of sub-objects: 15 | 16 |
17 | Structure 18 | | 19 | Model(s) 20 | | 21 | Chain(s) 22 | | 23 | Group(s) -> Chemical Component Definition 24 | | 25 | Atom(s) 26 |27 | 28 | All `Structure` objects contain one or more `Models`. That means also X-ray structures contain a "virtual" model which serves as a container for the chains. This allows to represent multi-model X-ray structures, e.g. from time-series analysis. The most common way to access chains is via: 29 | 30 | ```java 31 | List
135 | Structure -> Entity -> Chain 136 |137 | 138 | To explain this with an example, hemoglobin (e.g. PDB ID 4HHB) has two components, alpha 139 | and beta. Each of the entities has two copies (= chains) in the structure. IN 4HHB, alpha 140 | has the two chains with the IDs A, and C and beta the chains B, and D. In total, hemoglobin is 141 | built up out of four chains. 142 | 143 | This prints all the entities in a structure 144 | ```java 145 | Structure structure = StructureIO.getStructure("4hhb"); 146 | 147 | System.out.println(structure); 148 | 149 | System.out.println(" # of compounds (entities) " + structure.getEntityInfos().size()); 150 | 151 | for ( EntityInfo entity: structure.getEntityInfos()) { 152 | System.out.println(" " + entity); 153 | } 154 | ``` 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | --- 165 | 166 | Navigation: 167 | [Home](../README.md) 168 | | [Book 3: The Structure Modules](README.md) 169 | | Chapter 3 : Structure Data Model 170 | 171 | Prev: [Chapter 2 : First Steps](firststeps.md) 172 | 173 | Next: [Chapter 4 : Local Installations](caching.md) 174 | -------------------------------------------------------------------------------- /structure/symmetry.md: -------------------------------------------------------------------------------- 1 | Protein Symmetry using BioJava 2 | ================================================================ 3 | 4 | BioJava can be used to detect, analyze, and visualize **symmetry** and 5 | **pseudo-symmetry** in the **quaternary** (biological assembly) and tertiary 6 | (**internal**) structural levels of proteins. 7 | 8 | ## Quaternary Symmetry 9 | 10 | The **quaternary symmetry** of a structure defines the relation and arrangement of the individual chains or groups of chains that are part of a biological assembly. 11 | For a more exhaustive explanation about protein quaternary symmetery and the different types visit the [PDB help page](http://www.rcsb.org/pdb/staticHelp.do?p=help/viewers/jmol_symmetry_view.html). 12 | 13 | In the **quaternary symmetry** detection problem, we are given a set of chains (subunits) that are part of a biological assembly as input, defined by their atomic coordinates, and we are required to find the higest overall symmetry group that 14 | relates them as ouptut. 15 | The solution is divided into the following steps: 16 | 17 | 1. First, we need to identify the chains that are identical (or similar 18 | in the pseudo-symmetry case). For that purpose, we perform a pairwise alignment of all 19 | chains and identify **clusters of identical or similar subunits**. 20 | 2. Next, we reduce each of the polypeptide chains to a single point, their **centroid** (center of mass). 21 | 3. Afterwards, we try different **symmetry operations** using a grid search to superimpose the chain centroids 22 | and score them using the RMSD. 23 | 4. Finally, based on the parameters (cutoffs), we determine the **overall symmetry** of the 24 | structure, with the symmetry relations obtained in the previous step. 25 | 5. In case of asymmetric structure, we discard combinatorially a number of chains and try 26 | to detect any **local symmetries** present (symmetry that does not involve all subunits of the biological assembly). 27 | 28 | The **quaternary symmetry** detection algorithm is implemented in the biojava class 29 | [QuatSymmetryDetector](http://www.biojava.org/docs/api/org/biojava/nbio/structure/symmetry/core/QuatSymmetryDetector). 30 | An example of how to use it programatically is shown below: 31 | 32 | ```java 33 | // First download the structure in the biological assembly form 34 | Structure s; 35 | 36 | // Set some parameters if needed different than DEFAULT - see descriptions 37 | QuatSymmetryParameters parameters = new QuatSymmetryParameters(); 38 | SubunitClustererParameters clusterParams = new SubunitClustererParameters(); 39 | 40 | // Instantiate the detector 41 | QuatSymmetryDetector detector = QuatSymmetryDetector(s, parameters, clusterParams); 42 | 43 | // Static methods in QuatSymmetryDetector perform the calculation 44 | QuatSymmetryResults globalResults = QuatSymmetryDetector.getGlobalSymmetry(s, parameters, clusterParams); 45 | List