├── images ├── mview01.png ├── mview02.png ├── obabel01.png └── obabel02.png ├── Workshops └── EDirect Workshop_Scalfani_sp2021.pdf ├── ACS_sp2021_talk └── Scalfani_VF_EDirect_ACS_sp2021.pdf ├── LICENSE ├── README.md ├── 01_EDirect_Intro.md ├── 06_EDirect_Combining_Tools.md ├── 04_EDirect_PubChem_Recipes.md ├── 05_EDirect_PubMed_Recipes.md ├── 03_EDirect_PubChem_BioAssay_PubMed_Recipes.md └── 02_EDirect_Data_Fields_Structure.md /images/mview01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/images/mview01.png -------------------------------------------------------------------------------- /images/mview02.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/images/mview02.png -------------------------------------------------------------------------------- /images/obabel01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/images/obabel01.png -------------------------------------------------------------------------------- /images/obabel02.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/images/obabel02.png -------------------------------------------------------------------------------- /Workshops/EDirect Workshop_Scalfani_sp2021.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/Workshops/EDirect Workshop_Scalfani_sp2021.pdf -------------------------------------------------------------------------------- /ACS_sp2021_talk/Scalfani_VF_EDirect_ACS_sp2021.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/ACS_sp2021_talk/Scalfani_VF_EDirect_ACS_sp2021.pdf -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Vincent F. Scalfani 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # EDirectChemInfo 2 | 3 | **Notes** 4 | 5 | > Oct 21, 2024 - This repository has recently been transferred from The University of Alabama Libraries Web Services GitHub to The University of Alabama Libraries Research Data Services GitHub organization. 6 | > All GitHub related hyperlinks should automatically redirect to the new GitHub location, but if you notice anything that is not working correctly, please let us know. 7 | 8 | This repository contains Entrez Direct (EDirect, an NCBI tool) Unix scripts for programmatically obtaining data from various NCBI databases. Other EDirect resources and guides exist (referenced below). This EDirectChemInfo repository differs in that the focus is on teaching how to obtain chemical information, cheminformatics data, and chemical structure <--> bioassay <--> document relationship links. There are not many PubChem EDirect examples available, so hopefully this repository proves useful. I have also added some tips, step-wise directions, and code output examples to help you get started. 9 | 10 | Please note that this EDirectChemInfo repository is not affiliated with NCBI. You should contact [NCBI](https://www.ncbi.nlm.nih.gov/books/NBK179288/#_chapter6_For_More_Information_) for specific questions related to EDirect. This repository was created to accompany library instruction at The University of Alabama. With that in mind, please feel free to open a GitHub Issue or contact me directly with comments/questions if you think there is something I can help you with. In addition, if this repository has been a useful resource for you, please do let me know as this type of feedback can help prioritize my time. 11 | 12 | Vincent Scalfani\ 13 | Science and Engineering Librarian\ 14 | The University of Alabama\ 15 | [UA Libraries Directory](https://www.lib.ua.edu/#/staffdir?liaison=1&search=scalfani) 16 | 17 | ## Contents 18 | 19 | * [What is EDirect?](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md) 20 | * [Installation Tips](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md#installation-tips) 21 | * [Usage Tips](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md#usage-tips) 22 | * [EDirect Function Help and Debug](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md#e-utility-application-help) 23 | * [Available Databases, Data Fields, and Data Structures](https://github.com/vfscalfani/EDirectChemInfo/blob/master/02_EDirect_Data_Fields_Structure.md) 24 | * [PubChem <--> PubChem BioAssay <--> PubMed EDirect Recipes](https://github.com/vfscalfani/EDirectChemInfo/blob/master/03_EDirect_PubChem_BioAssay_PubMed_Recipes.md) 25 | * [PubChem EDirect Recipes](https://github.com/vfscalfani/EDirectChemInfo/blob/master/04_EDirect_PubChem_Recipes.md) 26 | * [PubMed EDirect Recipes](https://github.com/vfscalfani/EDirectChemInfo/blob/master/05_EDirect_PubMed_Recipes.md) 27 | * [Combining EDirect Results with Chemical Depiction and Plotting](https://github.com/vfscalfani/EDirectChemInfo/blob/master/06_EDirect_Combining_Tools.md) 28 | 29 | ## References 30 | 31 | These are the main references I used to learn about NCBI E-Utilities, the EDirect syntax, Unix commands/scripts, and the importance of linked chemical data. Many thanks to the authors for their work. 32 | 33 | 1. [NCBI Documentation for Entrez Direct: E-utilities on the UNIX Command Line](https://www.ncbi.nlm.nih.gov/books/NBK179288/) 34 | 2. [NIH NLM The Insider's Guide to Accessing NLM Data](https://dataguide.nlm.nih.gov/) 35 | 3. [NCBI EDirect Cookbook](https://github.com/NCBI-Hackathons/EDirectCookbook) 36 | 4. [Computational Genomics Manual: NCBI EDirect](https://github.com/linsalrob/ComputationalGenomicsManual/blob/master/Databases/NCBI_Edirect.md) 37 | 5. [Entrez Link Descriptions](https://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html) 38 | 6. [Software Carpentry: The Unix Shell](https://swcarpentry.github.io/shell-novice/) 39 | 7. [Opening up connectivity between documents, structures and bioactivity by Christopher Southan](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7136548/) 40 | 41 | 42 | ## License Notes 43 | 44 | Code in this repository is licensed under the [MIT License](https://github.com/vfscalfani/EDirectChemInfo/blob/master/LICENSE). Some of the chemical depiction demonstrations from EDirect output use proprietary software, such as ChemAxon Marvin, which is not included under this license. Users must have valid licenses for any required proprietary software to run these portions of the code. 45 | 46 | Code output (e.g., reference/molecular data snippets) retrieved from NCBI via their EDirect utility is shown for code demonstration purposes only and is credited to NCBI and NLM. Please see the [NCBI Website and Data Usage Policies and Disclaimers](https://www.ncbi.nlm.nih.gov/home/about/policies/) for more information regarding the data. 47 | 48 | -------------------------------------------------------------------------------- /01_EDirect_Intro.md: -------------------------------------------------------------------------------- 1 | # What is EDirect? 2 | 3 | EDirect is a Unix command line tool from NCBI that allows programmatic retrieval of chemical/biological data and literature references from NCBI databases. EDirect reduces the barrier to accessing NCBI data programmatically; that is, with a basic knowledge of the Unix shell (e.g., bash), it is straightforward to obtain and format your own custom datasets, often with only a few lines of code. Moreover, you can input data retrieved from EDirect into other Unix tools for quick viewing and analysis ([Pipeline (Unix))](https://en.wikipedia.org/wiki/Pipeline_(Unix)). 4 | 5 | ## Installation Tips 6 | 7 | Follow the installation instructions from NCBI: [Entrez Direct: E-utilities on the UNIX Command Line](https://www.ncbi.nlm.nih.gov/books/NBK179288/). There are several different methods to install EDirect. I used option 3 (EDirect v14.4) with `wget` in Gnome Terminal on a Linux Ubuntu 18.04 workstation. If you are using Windows, NCBI mentions that you can use the Cygwin Unix emulator. Another option for Windows users is to setup a Linux virtual machine. There are many tutorials for setting up virtual machines. For example, here is one for installing [Ubuntu on VirtualBox](https://askubuntu.com/questions/142549/how-to-install-ubuntu-on-virtualbox). When installing EDirect in a virtual machine, you may need to customize the VirtualBox network settings in order to use the `curl` or `wget` EDirect installation methods. In my testing on an Ubuntu 20.04 virtual machine, the fourth installation option for EDirect (using the longer perl script) worked fine with the standard VirtualBox network settings. 8 | 9 | ## Usage Tips 10 | 11 | NCBI has specific data usage policies and disclaimers: 12 | 13 | * [NCBI Website and Data Usage Policies and Disclaimers](https://www.ncbi.nlm.nih.gov/home/about/policies/) 14 | * [Entrez Programming Utilities Help](https://www.ncbi.nlm.nih.gov/books/NBK25501/) 15 | 16 | If you do not follow NCBI's usage policies (e.g., no more than 3 requests per second), NCBI may block your IP address. So be cautious and follow good programming practices of testing and adding sleep delays, particularly if executing multiple sequential calls in a loop. Moreover, it is always a good idea to include your email address in the requests so that NCBI can contact you if necessary. You can add your email address within each query like this: 17 | 18 | ```console 19 | 20 | user@computer:~$ e-function -email name@xx.edu -arg input 21 | 22 | ``` 23 | Replace `name@xx.edu` with your email address. The `e-function` is a place holder for one of the actual EDirect functions like `einfo` or `esearch`, and `-arg input` is a placeholder for e-function argument(s) like `-db pccompound` or `-db pubmed -query "food allergies"`. 24 | 25 | ## EDirect Function Help 26 | 27 | I generally refer to the official [Entrez Programming Utilities Help Document](https://www.ncbi.nlm.nih.gov/books/NBK25501/) or the [NIH NLM E-Utilities Documentation](https://dataguide.nlm.nih.gov/eutilities/utilities.html), however for a quick reference or reminder of the proper syntax, the `-help` option is useful. Here is an example with the `einfo` function: 28 | 29 | ```console 30 | 31 | user@computer:~$ einfo -help 32 | einfo 14.4 33 | 34 | Database Selection 35 | 36 | -dbs Print all database names 37 | -db Database name (or "all") 38 | 39 | Data Summaries 40 | 41 | -fields Print field names 42 | -links Print link names 43 | 44 | Field Example 45 | 46 | 47 | ALL 48 | All Fields 49 | All terms from all searchable fields 50 | 245340803 51 | N 52 | N 53 | N 54 | N 55 | N 56 | Y 57 | N 58 | 59 | 60 | Link Example 61 | 62 | 63 | pubmed_protein 64 | Protein Links 65 | Published protein sequences 66 | protein 67 | 68 | 69 | pubmed_protein_refseq 70 | Protein (RefSeq) Links 71 | Link to Protein RefSeqs 72 | protein 73 | 74 | 75 | ``` 76 | 77 | ## EDirect Query Translation via Debug Flag 78 | 79 | When experimenting with searches in EDirect, it is often helpful to view the interpreted query. This can be accomplished using the `-debug` flag in EDirect 14.4 (thanks to NLM Support for the explanation and tip!): 80 | 81 | ```console 82 | 83 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "hydrogel-based drug delivery" -debug 84 | nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi -retmax 0 -usehistory y -db pubmed -term "hydrogel-based drug delivery" -tool edirect -edirect 14.4 -edirect_os Linux -email name@xx.edu 85 | 86 | pubmed 87 | MCID... 88 | 1 89 | 436 90 | 1 91 | name@xx.edu 92 | Y 93 | 94 | 95 | ``` 96 | Next, copy and run the `nquire` command and pipe the results to `xtract`, extracting out the QueryTranslation element: 97 | 98 | ```console 99 | 100 | user@computer:~$ nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi -retmax 0 -usehistory y -db pubmed -term "hydrogel-based drug delivery" | xtract -pattern eSearchResult -element QueryTranslation 101 | hydrogel-based[All Fields] AND ("drug delivery systems"[MeSH Terms] OR ("drug"[All Fields] AND "delivery"[All Fields] AND "systems"[All Fields]) OR "drug delivery systems"[All Fields] OR ("drug"[All Fields] AND "delivery"[All Fields]) OR "drug delivery"[All Fields]) 102 | 103 | ``` 104 | 105 | -------------------------------------------------------------------------------- /06_EDirect_Combining_Tools.md: -------------------------------------------------------------------------------- 1 | # Combining EDirect Results with Chemical Depiction and Plotting 2 | 3 | **Notes** 4 | 5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`. 6 | > 2. Replace `name@xx.edu` with your email address. 7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal. 8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/). 9 | 10 | ## EDirect --> Chemical Depiction and Plots 11 | 12 | It is possible to pipe EDirect results into chemical structure viewers as some cheminformatics toolkits can read chemical file formats (e.g., SMILES) directly from standard input. 13 | 14 | ### ChemAxon MarvinView Chemical Depiction 15 | 16 | For [ChemAxon Marvin](https://chemaxon.com/products/marvin), we can pipe EDirect compiled SMILES directly into Marvin View (`mview`). Note that the `-` is the `mview` option to read structures from standard input. 17 | 18 | ```console 19 | 20 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "132427739"[UID] | \ 21 | > efetch -format docsum | \ 22 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName | \ 23 | > mview - 24 | 25 | ``` 26 | 27 | ![mview01](/images/mview01.png) 28 | 29 | If you have multiple molecules to display, you can use the `mview` standard input option `-` along with the `gridbag` option to display the molecules in a matrix: 30 | 31 | 32 | ```console 33 | 34 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "132427739"[UID] | \ 35 | > elink -target pccompound -name pccompound_pccompound | \ 36 | > efetch -format docsum | \ 37 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName | \ 38 | > mview --gridbag - 39 | 40 | ``` 41 | 42 | ![mview02](/images/mview02.png) 43 | 44 | _tested with ChemAxon MarvinView version 19.27.0._ 45 | 46 | ### Open Babel Chemical Depiction 47 | 48 | One really cool feature of using [Open Babel](https://github.com/openbabel) is the ability to display molecules as ASCII figures directly in the terminal. Below, we pipe the results to Open Babel using the standard input smiles format, `-ismi`, and then output in ascii format, `-oascii`. The `-xh 10` is a resizing option. 49 | 50 | ```console 51 | 52 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "13586"[UID] | \ 53 | > efetch -format docsum | \ 54 | > xtract -pattern DocumentSummary -element IsomericSmiles | \ 55 | > openbabel.obabel -ismi -oascii -xh 10 56 | __ 57 | __ \__ 58 | _/ \__ \_ 59 | __/ \_ \_ 60 | O ________/ \_ 61 | / 62 | / 63 | \ | 64 | \ / 65 | \/ 66 | 1 molecule converted 67 | ``` 68 | 69 | This works for multiple molecules too! 70 | 71 | ```console 72 | 73 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "132427739"[UID] | \ 74 | > elink -target pccompound -name pccompound_pccompound | \ 75 | > efetch -format docsum | \ 76 | > xtract -pattern DocumentSummary -element IsomericSmiles | \ 77 | > openbabel.obabel -ismi -oascii -xh 10 78 | \ 79 | ____ 80 | | 81 | __ O_Si__ 82 | O\ \__\ | 83 | |\N / \ 84 | __O| \__/ 85 | /__/ \ OH 86 | | \ 87 | /| / 88 | / 89 | / | 90 | \_ | O ____ 91 | \__O___ |____ 92 | __Si_O N \\/ \ 93 | \ /| / 94 | | \___ __N 95 | \__O ____ 96 | O / ___\/ \\ 97 | / /| / 98 | __ __ 99 | \_/ 100 | | / 101 | | | 102 | /____ O __Si__/ 103 | |/ \ | | | 104 | / \ O\\ ____ _| \ 105 | \ ___ | || / \_/ \__ 106 | \____/ \\_N \ / 107 | \ | _/ | / 108 | ___ 109 | / \/ 110 | || |\ 111 | \__ 112 | \_O ___ 113 | |_N/ \ / 114 | O|/ _\__\_/ 115 | / /_/ O_Si__ 116 | \_ 117 | \O | 118 | / 119 | _Si/ 120 | \ |_ 121 | O____O | \ 122 | __|Si_ | | ____ 123 | / \_\ __ | | 124 | / | / \____N__| 125 | \O| | / \_ 126 | \O/\___ \ 127 | \ | || 128 | \_ / 129 | ___ \ 130 | / \ O_Si_| 131 | |__\ \ \ \ | 132 | /\_O \__\ _\ 133 | __N / \/ \ 134 | \ / / | 135 | | O \__/ 136 | \ ___\ 137 | ____/ 138 | 139 | ... 140 | 24 molecules converted 141 | ``` 142 | 143 | You can save depictions in a more classic PNG file using Open Babel with either a single molecule: 144 | 145 | ```console 146 | 147 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "13586"[UID] | \ 148 | > efetch -format docsum | \ 149 | > xtract -pattern DocumentSummary -element IsomericSmiles CID | \ 150 | > openbabel.obabel -ismi -O 13586.png 151 | 1 molecule converted 152 | ``` 153 | 154 | ![obabel01](/images/obabel01.png) 155 | 156 | 157 | or multiple molecules in a matrix: 158 | 159 | ```console 160 | 161 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "132427739"[UID] | \ 162 | > elink -target pccompound -name pccompound_pccompound | \ 163 | > efetch -format docsum | \ 164 | > xtract -pattern DocumentSummary -element IsomericSmiles CID | \ 165 | > openbabel.obabel -ismi -O 132427739_similar.png -xp 1400 166 | 24 molecules converted 167 | ``` 168 | 169 | ![obabel02](/images/obabel02.png) 170 | 171 | _Tested with Open Babel v3.0.0 installed from Snap. I did receive a Font Configuration error when saving the PNG files, however, the conversion seemed to work fine._ 172 | 173 | ### gnuplot Data plotting 174 | 175 | [gnuplot](http://www.gnuplot.info/) is a command-line graphing program that allows plotting data from standard input. In gnuplot, there is an option called "dumb terminal" that creates plots using ASCII characters directly in the terminal window, which is convenient for initial analysis of compiled EDirect data. For example, here is some data related to the number of *J Cheminform* articles indexed in PubMed by publication date: 176 | 177 | ```console 178 | 179 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Cheminform[JOUR]" | \ 180 | > efetch -format docsum | \ 181 | > xtract -pattern DocumentSummary -element PubDate | \ 182 | > cut -d " " -f 1 | \ 183 | > sort-uniq-count-rank | \ 184 | > sort -k2 185 | 22 2009 186 | 12 2010 187 | 54 2011 188 | 39 2012 189 | 52 2013 190 | 71 2014 191 | 78 2015 192 | 71 2016 193 | 67 2017 194 | 68 2018 195 | 56 2019 196 | ``` 197 | We can pipe this data directly to gnuplot. In the below script, `set term dumb` is the gnuplot option to create an ASCII plot, `-` sets the data input to standard input instead of a file, `using 2:1` sets the second column as the x-axis, and the first column as the y-axis, `with boxes` creates a box plot, and `notitle` removes the plot legend: 198 | 199 | ```console 200 | 201 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Cheminform[JOUR]" | \ 202 | > efetch -format docsum | \ 203 | > xtract -pattern DocumentSummary -element PubDate | \ 204 | > cut -d " " -f 1 | \ 205 | > sort-uniq-count-rank | \ 206 | > sort -k2 | \ 207 | > gnuplot -e "set term dumb; plot '-' using 2:1 with boxes notitle" 208 | 209 | 210 | 80 +---------------------------------------------------------------------+ 211 | | + + + ******* + + | 212 | | * * | 213 | 70 |-+ ******* ******* ******* +-| 214 | | * * * ****** * | 215 | | * * * * * * | 216 | 60 |-+ * * * * * * +-| 217 | | ****** * * * * * ******* | 218 | | * * ******* * * * * * * | 219 | 50 |-+ * * * * * * * * * *+-| 220 | | * * * * * * * * * * | 221 | 40 |-+ * * * * * * * * * *+-| 222 | | * ******* * * * * * * * | 223 | | * * * * * * * * * * | 224 | 30 |-+ * * * * * * * * * *+-| 225 | | * * * * * * * * * * | 226 | | * * * * * * * * * * | 227 | 20 |-+******* * * * * * * * * * *+-| 228 | | * * * * * * * * * * * * | 229 | | * ******* * + * * + * * + * * + * * | 230 | 10 +---------------------------------------------------------------------+ 231 | 2008 2010 2012 2014 2016 2018 2020 232 | 233 | ``` 234 | 235 | _Tested with gnuplot-x11 5.2.8._ 236 | 237 | 238 | -------------------------------------------------------------------------------- /04_EDirect_PubChem_Recipes.md: -------------------------------------------------------------------------------- 1 | # PubChem EDirect Recipes 2 | 3 | **Notes** 4 | 5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`. 6 | > 2. Replace `name@xx.edu` with your email address. 7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal. 8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/). 9 | 10 | ## PubChem EDirect 11 | 12 | ### Search PubChem Compound via InChIKey and Retrieve Data 13 | 14 | In the below script, we first query the PubChem Compound database (`pccompound`) for "NJTXJDYZPQNTSM-WMZOPIPTSA-N" in the InChIKey (`[IKEY]`) field. Next, the record is retrieved in XML docsum and several properties are extracted with the `xtract` function including the IsomericSmiles, CID, InChIKey, and IUPACName. 15 | 16 | ```console 17 | 18 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "NJTXJDYZPQNTSM-WMZOPIPTSA-N"[IKEY] | \ 19 | > efetch -format docsum | \ 20 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName 21 | C[C@]12CCC(=O)C=C1CCC[C@@H]2OC(=O)C3=CC=CC=C3 11044292 NJTXJDYZPQNTSM-WMZOPIPTSA-N [(1S,8aS)-8a-methyl-6-oxo-1,2,3,4,7,8-hexahydronaphthalen-1-yl] benzoate 22 | ``` 23 | _tested on 2021.01.27, EDirect 14.4, total count was 1._ 24 | 25 | ### Search PubChem Compound with a list of CIDs and Retrieve Data 26 | 27 | If we have a small list of PubChem Compound Identifiers (CIDs) and need to retrieve specific data for each CID, we can write a for loop directly in the terminal. Note that in the below Bash script, I added a sleep of one second within the loop in an effort to not overload the NCBI servers. 28 | 29 | ```console 30 | 31 | user@computer:~$ for myCID in \ 32 | > "146021325" \ 33 | > "11068043" \ 34 | > "11615487" \ 35 | > "10056179" \ 36 | > "169731" 37 | > do 38 | > esearch -email name@xx.edu -db pccompound -query "$myCID[UID]" | 39 | > efetch -format docsum | 40 | > xtract -pattern DocumentSummary -lbl "$myCID" -element IsomericSmiles InChIKey MolecularFormula MolecularWeight 41 | > sleep 1 42 | > done 43 | 146021325 CN1C(=CN=C1Cl)C(/C=C/C2=CC=CC=C2)(C3=CC=CC=C3)O AKSFJXCUMHAJKP-OUKQBFOZSA-N C19H17ClN2O 324.800 44 | 11068043 CC(C)[Si](C(C)C)(C(C)C)OC(CCCC1=CCC=CC1)CC=C JDKBJINLKCZQNX-UHFFFAOYSA-N C22H40OSi 348.600 45 | 11615487 CC1=CC(=C(C=C1)NC(=O)C(C)(C)C)OC WMXHBZHMNCGLQQ-UHFFFAOYSA-N C13H19NO2 221.290 46 | 10056179 CC(=O)N1CN2C3=CC=CC=C3C(=C2C4=CC=CC=C41)C(C(=O)NCC[Se]C5=CC=CC=C5)O MJWHOMSECISGAK-UHFFFAOYSA-N C27H25N3O3Se 518.500 47 | 169731 C1=CC=C2C(=C1)C=C(N2)CC#N RORMSTAFXZRNGK-UHFFFAOYSA-N C10H8N2 156.180 48 | ``` 49 | _tested on 2021.01.27, EDirect 14.4, total count was 5 (as expected in the for loop)._ 50 | 51 | ### Retrieve Pre-Computed Linked Similar Compounds 52 | 53 | In the below script, we use the `esearch` function to query the PubChem Compound database (`pccompound`) for CID 11044292 within the Compound ID field (`[uid]`). The `esearch` results are then piped to `elink` finding related PubChem Compounds via the Entrez link `pccompound_pccompound`. 54 | 55 | ```console 56 | 57 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "11044292"[UID] | \ 58 | > elink -target pccompound -name pccompound_pccompound | \ 59 | > efetch -format docsum | \ 60 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName 61 | ... 62 | ... 63 | CC(C1=CC=CC=C1)OC(=O)C2=CC=C(C=C2)C(C)(C)C 152679150 ZNDWBMXQOLFTJJ-UHFFFAOYSA-N 1-phenylethyl 4-tert-butylbenzoate 64 | CCCCC1(CCC(C(C1)OC(=O)C2=CC=CC=C2)C(C)C)C 152242148 WDOHMQDXGOVKBZ-UHFFFAOYSA-N (5-butyl-5-methyl-2-propan-2-ylcyclohexyl) benzoate 65 | CCC1=CC=CC=C1C(=O)OC2C=CC(=O)CC2(C)C 150893175 KYLDSGZLUQKTGC-UHFFFAOYSA-N (6,6-dimethyl-4-oxocyclohex-2-en-1-yl) 2-ethylbenzoate 66 | CC1CCC(C(CC1=O)(C)C)OC(=O)C2=CC=CC=C2 150335011 GQMHTGLOWBLGSB-UHFFFAOYSA-N (2,2,5-trimethyl-4-oxocycloheptyl) benzoate 67 | ... 68 | ... 69 | ``` 70 | _tested on 2021.01.27, EDirect 14.4, total count was 238._ 71 | 72 | ### Find Compounds with Specific Attributes 73 | 74 | There are a variety of methods to limit results and find compounds with specific attributes in PubChem Compound. The below script, for example, uses the `efilter` function to limit the `elink` similarity results to compounds with active assays using the query "pccompound_pcassay_active" in the filter (`[FILT]`) field: 75 | 76 | ```console 77 | 78 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "11044292"[UID] | \ 79 | > elink -target pccompound -name pccompound_pccompound | \ 80 | > efilter -query "pccompound_pcassay_active"[FILT] | \ 81 | > efetch -format docsum | \ 82 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName 83 | CC1(CC(=O)C=C(C1=O)C2=CC=C(C=C2)COC(=O)C3=CC=CC=C3)C 46904830 DSGHHJWVVKGFNI-UHFFFAOYSA-N [4-(5,5-dimethyl-3,6-dioxocyclohexen-1-yl)phenyl]methyl benzoate 84 | CC1=CC[C@H](/C(=C\[C@@H](C(CCC1)(C)C)OC(=O)C)/C)OC(=O)C2=CC=CC=C2 46886858 WZEJCPHMOSKQHH-FNXKFTHESA-N [(1R,2Z,4S)-4-acetyloxy-2,5,5,9-tetramethylcycloundeca-2,9-dien-1-yl] benzoate 85 | CC1=C2[C@H]([C@@H]([C@@]3(C=CC(=O)C(=C)[C@H]3C=C2CC1=O)C)OC(=O)C)OC(=O)C4=CC=CC=C4 44585423 DSOLMHLLLLRMOJ-BNWQNPBSSA-N [(4R,5R,5aR,9aS)-5-acetyloxy-3,5a-dimethyl-9-methylidene-2,8-dioxo-1,4,5,9a-tetrahydrobenzo[g]azulen-4-yl] benzoate 86 | CCOC(=O)C1=CC=CC(=C1)C2=CC(=O)CC(C2)(C)C 44143998 BBEWYQFSZQEXCH-UHFFFAOYSA-N ethyl 3-(5,5-dimethyl-3-oxocyclohexen-1-yl)benzoate 87 | CCC1=C(C(C(OC1=O)C2=CC=CC=C2)(C)C)OC(=O)C3=CC=CC=C3 2893657 FEWXNYDYEILDTL-UHFFFAOYSA-N (5-ethyl-3,3-dimethyl-6-oxo-2-phenyl-2H-pyran-4-yl) benzoate 88 | CC1=C(C(C(OC1=O)C2=CC=CC=C2)(C)C)OC(=O)C3=CC=CC=C3 569453 UXXMZHQXFIACTH-UHFFFAOYSA-N (3,3,5-trimethyl-6-oxo-2-phenyl-2H-pyran-4-yl) benzoate 89 | ``` 90 | 91 | _tested on 2021.01.27, EDirect 14.4, total count was 6._ 92 | 93 | Another filtering method could be to add a specific property attribute range, such as compounds containing 8 to 12 rotatable bonds (`[RBC]`): 94 | 95 | ```console 96 | 97 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "11044292"[UID] | \ 98 | > elink -target pccompound -name pccompound_pccompound | \ 99 | > efilter -query "8:12"[RBC] | \ 100 | > efetch -format docsum | \ 101 | > xtract -pattern DocumentSummary -element IsomericSmiles CID RotatableBondCount 102 | CCC(CC)(CCCCCOC(=O)C1=CC=CC=C1)C(=O)C2=CC=CC=C2 153964625 12 103 | CCCCCCCC1CCC(CC1)OC(=O)C2=CC=CC=C2 153717993 9 104 | CCCCCCC1CCC(CC1)OC(=O)C2=CC=CC=C2 153717992 8 105 | CC(C(=O)CCCC(C)(C)CCC(C)(C)C)OC(=O)C1=CC=CC=C1 153334776 11 106 | COC(=O)CCCCC[C@H](C1=CC=CC=C1)OC(=O)C2=CC=CC=C2 145778504 11 107 | CCC(C(CC(C)(C)C)OC(=O)C1=CC=CC=C1)OC(=O)C2=CC=CC=C2 142273534 10 108 | ... 109 | ... 110 | ``` 111 | 112 | _tested on 2021.01.27, EDirect 14.4, total count was 57._ 113 | 114 | It is also possible to query PubChem Compound directly for compounds with specific attributes (i.e., without the use of `efilter`). However, you will likely need to be very specific in order to retrieve a reasonable number of records. For example, in the the below script, PubChem Compound was queried for compounds containing Uranium in the element field (`[ELMT]`) and 3:5 defined chiral atoms in the AtomChiralDefCount field (`[ACDC]`): 115 | 116 | 117 | ```console 118 | 119 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "(\"U\"[ELMT]) AND \"3\"[ACDC]:\"5\"[ACDC]" | \ 120 | > efetch -format docsum | \ 121 | > xtract -pattern DocumentSummary -element IsomericSmiles CID MolecularFormula AtomChiralDefCount 122 | C[C@@H]1CC[C@H]([C@@]2([C@]1(CCC(=C2)C)C)C)[C-]=C.O.[U] 154676185 C16H27OU- 4 123 | C[CH-]O[C@H]1[C@@H]([C@H]([C@@H](O[C@@H]1[CH2-])O)O)C.[U+2] 154572041 C9H16O4U 5 124 | [CH3-].CNCCO[C@@H]1CNC[C@@H](C1C2=CC=C(C=C2)O[C@H]3CCN(C3)C4=CC(=CC=C4)F)OCC5=CC6=C(C=C5)OCCN6CC[CH2-].[U+2] 154550507 C37H49FN4O4U 3 125 | C[C@H]1CC=C(CN1C)C2=CSC(=N2)SC3=C(N4[C-]([C@H]3C)[C@H](C4=O)[C@@H](C)O)C(=O)O.[U] 154536644 C20H24N3O4S2U- 4 126 | CC1CC2C3[C@H](C=C4C[C-](CC[C@@]4(C3CC[C@@]2(C15OCCO5)C)C)OC[CH2-])O.[U+2] 154528690 C24H36O4U 3 127 | COC1=C(C=C2C(=C1)C(=O)N3CC(=C)C[C@H]3[C@@H]([N-]2)O)OCCCCCOC4=C(C=C5C(=C4)N=C[C@@H]6CC(=C)CN6C5=O)OC.[U] 153695434 C33H37N4O7U- 3 128 | ... 129 | ... 130 | ``` 131 | 132 | _tested on 2021.01.27, EDirect 14.4, total count was 1938._ 133 | 134 | Note that I escaped (`\`) the internal quotes in the above query. Sometimes this is not necessary (in my experience it depends on the NCBI database). If you are unsure how the query is being interpreted, run the `esearch` function with the `-debug` option. You can then use `nquire` with the link output and extract out the parsed query: 135 | 136 | 137 | ```console 138 | 139 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "(\"U\"[ELMT]) AND \"3\"[ACDC]:\"5\"[ACDC]" -debug 140 | ... 141 | ... 142 | user@computer:~$ nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi -retmax 0 -usehistory y -db pccompound -term "(\"U\"[ELMT]) AND \"3\"[ACDC]:\"5\"[ACDC]" | \ 143 | > xtract -pattern eSearchResult -element QueryTranslation 144 | "U"[ELMT] AND "3"[AtomChiralDefCount] : "5"[AtomChiralDefCount] 145 | ``` 146 | 147 | ### Find Number of Compounds by Create Date 148 | 149 | We can use the Create Date field `[CDAT]` in PubChem Compound to search for compound records created on a specific date. The `esearch` results are then piped into `efetch` to retrieve the XML docsum compound records. Next, the `xtract` function is used to extract out the CreateDate. The extracted data is then piped into the EDirect alias function `sort-uniq-count-rank`, which sorts the data by highest frequency. Finally, I added an additional `sort` command, to sort by date (`-k2,2` for second column), instead of number of compounds. 150 | 151 | ```console 152 | 153 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "2020/05/01"[CDAT]:"2020/05/31"[CDAT] | \ 154 | > efetch -format docsum | \ 155 | > xtract -pattern DocumentSummary -element CreateDate | \ 156 | > sort-uniq-count-rank | \ 157 | > sort -k2,2 158 | 448 2020/05/01 00:00 159 | 20 2020/05/02 00:00 160 | 45 2020/05/04 00:00 161 | 3 2020/05/05 00:00 162 | 67 2020/05/06 00:00 163 | 32 2020/05/07 00:00 164 | 97 2020/05/08 00:00 165 | 7827 2020/05/11 00:00 166 | 42 2020/05/12 00:00 167 | 573 2020/05/13 00:00 168 | 67 2020/05/14 00:00 169 | 75 2020/05/15 00:00 170 | 136 2020/05/16 00:00 171 | 69 2020/05/18 00:00 172 | 52 2020/05/19 00:00 173 | 66 2020/05/20 00:00 174 | 53 2020/05/21 00:00 175 | 790 2020/05/22 00:00 176 | 36 2020/05/23 00:00 177 | 2 2020/05/24 00:00 178 | 5 2020/05/25 00:00 179 | 530 2020/05/26 00:00 180 | 63 2020/05/27 00:00 181 | 169 2020/05/28 00:00 182 | 9432 2020/05/29 00:00 183 | 26 2020/05/30 00:00 184 | 2 2020/05/31 00:00 185 | ``` 186 | _tested on 2021.01.27, EDirect 14.4._ 187 | 188 | If we want to obtain the number of compounds in PubChem by create date over a longer period of time (e.g., several months to years), it probably does not make sense to use `efetch`, as the number of compounds will be hundreds of thousands or even millions. Trying to download all of the docsums for this many record likely won't work. As an alternative, we can use `esearch` in a for loop, and extract out the Count value from the `esearch` ENTREZ_DIRECT query XML summary. For example, if we wanted the number of compounds created in PubChem for 2019 by month: 189 | 190 | ```console 191 | 192 | user@computer:~$ for date in \ 193 | > "2019/01" \ 194 | > "2019/02" \ 195 | > "2019/03" \ 196 | > "2019/04" \ 197 | > "2019/05" \ 198 | > "2019/06" \ 199 | > "2019/07" \ 200 | > "2019/08" \ 201 | > "2019/09" \ 202 | > "2019/10" \ 203 | > "2019/11" \ 204 | > "2019/12" 205 | > do 206 | > esearch -email name@xx.edu -db pccompound -query "$date[CDAT]" | 207 | > xtract -pattern ENTREZ_DIRECT -lbl "$date" -element Count 208 | > sleep 1 209 | > done 210 | 2019/01 1843612 211 | 2019/02 7970 212 | 2019/03 219313 213 | 2019/04 469125 214 | 2019/05 324068 215 | 2019/06 64691 216 | 2019/07 302326 217 | 2019/08 154938 218 | 2019/09 119817 219 | 2019/10 236148 220 | 2019/11 308727 221 | 2019/12 5444411 222 | ``` 223 | _tested on 2021.01.27, total count was 12 (as expected in the for loop)._ 224 | 225 | In the above for loop bash script, I added a sleep of one second between each `esearch` query in an effort to not overload the NCBI servers. 226 | 227 | ### Find Related PubChem Substances (same) 228 | 229 | To find the number of related PubChem substances for a PubChem compound, we can use `elink` with Entrez link `pccompound_pcsubstance_same`: 230 | 231 | ```console 232 | 233 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "14333"[UID] | \ 234 | > elink -target pcsubstance -name pccompound_pcsubstance_same | \ 235 | > xtract -pattern ENTREZ_DIRECT -element Count 236 | 51 237 | ``` 238 | 239 | And then to retrieve information about the PubChem substances, we can pipe these results into `efetch` and `xtract`, to extract out specific information such as the SID, CurrentSourceName, SourceID, and DepositDate: 240 | 241 | ```console 242 | 243 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "14333"[UID] | \ 244 | > elink -target pcsubstance -name pccompound_pcsubstance_same | \ 245 | > efetch -format docsum | \ 246 | > xtract -pattern DocumentSummary -element SID CurrentSourceName SourceID DepositDate 247 | ... 248 | 439452693 THE BioTek bt-308998 2020/12/31 00:00 249 | 438657242 Alfa Chemistry ACM1132394 2020/12/09 00:00 250 | 438538915 3WAY PHARM INC SWOT-0105728 2020/12/08 00:00 251 | 435642079 Chem-Space.com Database CSSB00032005459 2020/11/21 00:00 252 | 410573132 Google Patents 15237363 2020/08/12 00:00 253 | 404911410 The University of Alabama Libraries UALIB-1927 2020/03/21 00:00 254 | 403383863 PATENTSCOPE (WIPO) ORQWTLCYLDRDHK-UHFFFAOYSA-N 2020/01/24 00:00 255 | 387135315 NORMAN Suspect List Exchange ORQWTLCYLDRDHK-UHFFFAOYSA-N 2019/11/22 00:00 256 | 386279116 Wiley 140582 2019/10/23 00:00 257 | ... 258 | ... 259 | ``` 260 | _tested on 2021.01.27, total count was 51._ 261 | 262 | -------------------------------------------------------------------------------- /05_EDirect_PubMed_Recipes.md: -------------------------------------------------------------------------------- 1 | # PubMed EDirect Recipes 2 | 3 | **Notes** 4 | 5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`. 6 | > 2. Replace `name@xx.edu` with your email address. 7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal. 8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/). 9 | 10 | ## PubMed EDirect 11 | 12 | ### Search PubMed by Keyword and/or MeSH and Retrieve References 13 | 14 | We can use the EDirect function `esearch` to query PubMed. However, before trying to retrieve any of the results with `efetch`, it is a good idea to check that the count range is manageable (e.g., on the order of several thousand). In addition, see the [EDirect Query Translation Instructions](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md#edirect-query-translation-via-debug-flag) for how to use the `-debug` option to view how your query is interpreted in PubMed. 15 | 16 | ```console 17 | 18 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "hydrogel-based drug delivery" 19 | 20 | pubmed 21 | MCID... 22 | 1 23 | 436 24 | 1 25 | name@xx.edu 26 | 27 | ``` 28 | 29 | After deciding if the `esearch` query is appropriate, we can start to pipe the `esearch` results into other EDirect functions. For example, the below script first uses `esearch` to query PubMed for "hydrogel-based drug delivery", and then these results are piped (`|`) into `efetch` to retrieve the results as XML format. The `efetch` results are then piped to the `xtract` function where several bibliographic elements of the PubMed XML records are extracted into a table: 30 | 31 | ```console 32 | 33 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "hydrogel-based drug delivery" | \ 34 | > efetch -format xml | \ 35 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \ 36 | > Author/Initials ArticleTitle ISOAbbreviation PubDate/Year Volume Issue MedlinePgn 37 | 33424262 El-Masry SM Hydrogel-based matrices for controlled drug delivery of etamsylate: Prediction of in-vivo plasma profiles. Saudi Pharm J 2020 28 12 1704-1718 38 | 33398321 Chen W Magnetically actuated intelligent hydrogel-based child-parent microrobots for targeted drug delivery. J Mater Chem B 2021 39 | 33396629 Dehshahri A New Horizons in Hydrogels for Methotrexate Delivery. Gels 2020 7 1 40 | 33387892 Amiri M Hydrogel beads-based nanocomposites in novel drug delivery platforms: Recent trends and developments. Adv Colloid Interface Sci 2020 288 102316 41 | 33378390 Kloepping KC Triphenylphosphonium derivatives disrupt metabolism and inhibit melanoma growth in vivo when delivered via a thermosensitive hydrogel. PLoS One 2020 15 12 e0244540 42 | 33359482 Agarwal P Structural characterization and developability assessment of sustained release hydrogels for rapid implementation during preclinical studies. Eur J Pharm Sci 2021 158 105689 43 | ... 44 | ... 45 | ``` 46 | 47 | _tested on 2021.01.27, EDirect 14.4, total count was 436._ 48 | 49 | 50 | Note that if we want to extract out the DOIs, we can use the `xtract` `-block` option like this: 51 | 52 | ```console 53 | 54 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "hydrogel-based drug delivery" | \ 55 | > efetch -format xml | \ 56 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \ 57 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \ 58 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId 59 | 33424262 El-Masry SM Saudi Pharm J 2020 28 12 1704-1718 https://doi.org/10.1016%2Fj.jsps.2020.10.016 60 | 33398321 Chen W J Mater Chem B 2021 https://doi.org/10.1039%2Fd0tb02384a 61 | 33396629 Dehshahri A Gels 2020 7 1 https://doi.org/10.3390%2Fgels7010002 62 | 33387892 Amiri M Adv Colloid Interface Sci 2020 288 102316 https://doi.org/10.1016%2Fj.cis.2020.102316 63 | 33378390 Kloepping KC PLoS One 2020 15 12 e0244540 https://doi.org/10.1371%2Fjournal.pone.0244540 64 | 33359482 Agarwal P Eur J Pharm Sci 2021 158 105689 https://doi.org/10.1016%2Fj.ejps.2020.105689 65 | ... 66 | ... 67 | ``` 68 | _tested on 2021.01.27, EDirect 14.4, total count was 436._ 69 | 70 | 71 | There is a lot going on with the last line of code that extracts out the DOIs: `-block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId`. Let's look at part of a PubMed XML file to help interpret what is going on here: 72 | 73 | ```console 74 | ... 75 | ... 76 | 77 | 17630804 78 | 10.1021/jo071035l 79 | 80 | ... 81 | ... 82 | ``` 83 | 84 | The `-block` option limits the extraction to a particular section of the XML, in this case the `ArticleId` tags. The `@` defines the desired IdType `doi` element attribute. Finally, the `-doi` is an `xtract` string option that prefixes https://doi.org/ before the extracted ArticleId doi. There is a more thorough explanation of `-block` and extracting out the DOIs with the `-block` option in the [NLM Insider's Guide to Accessing NLM Data Part 4](https://dataguide.nlm.nih.gov/classes/edirect-for-pubmed/samplecode4.html#output-a-list-of-pmids-and-corresponding-dois) and [Entrez Programming Utilities Help Manual](https://www.ncbi.nlm.nih.gov/books/NBK179288/). 85 | 86 | 87 | Similarly to the above script, we can specify particular fields to query within PubMed. The below script searches for "ionic liquids" in the MeSH term field (`[MESH]`) and "imidazolium" in all fields. Note that the internal quotes are escaped (`\`), which is sometimes necessary for the query to be interpreted correctly when using phrases. 88 | 89 | 90 | ```console 91 | 92 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "\"ionic liquids\"[MESH] AND imidazolium" | \ 93 | > efetch -format xml | \ 94 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \ 95 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \ 96 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId 97 | 33396149 Hu LX Ecotoxicol Environ Saf 2021 208 111629 https://doi.org/10.1016%2Fj.ecoenv.2020.111629 98 | 33346267 Kaur M Phys Chem Chem Phys 2021 23 1 320-328 https://doi.org/10.1039%2Fd0cp04513f 99 | 33253998 Tashakkori P J Chromatogr A 2021 1635 461741 https://doi.org/10.1016%2Fj.chroma.2020.461741 100 | 33142384 Ren YM Zhonghua Lao Dong Wei Sheng Zhi Ye Bing Za Zhi 2020 38 10 767-769 https://doi.org/10.3760%2Fcma.j.cn121094-20191010-00483 101 | 33135708 Kumar S Phys Chem Chem Phys 2020 22 43 25255-25263 https://doi.org/10.1039%2Fd0cp04014b 102 | 32822985 Zuo L J Chromatogr A 2020 1628 461446 https://doi.org/10.1016%2Fj.chroma.2020.461446 103 | 32711338 Zunita M Bioresour Technol 2020 315 123864 https://doi.org/10.1016%2Fj.biortech.2020.123864 104 | ... 105 | ... 106 | ``` 107 | _tested on 2021.01.27, EDirect 14.4, total count was 1000._ 108 | 109 | 110 | ### Calculate the Most Frequent Journal Titles For a PubMed Search 111 | 112 | The below script uses `esearch` to query PubMed for "Artificial Intelligence" in the `[MESH]` field and "drug discovery" in the `[ALL]` field. The records are then retrieved as XML format using the `efetch` function, followed by extracting out the journal names (`IsoAbbreviation`) using `xtract`. The `xtract` results are then piped to the EDirect alias function `sort-uniq-count-rank`, which sorts the data by highest frequency: 113 | 114 | ```console 115 | 116 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "\"Artificial Intelligence\"[MESH] AND \"drug discovery\"[ALL]" | \ 117 | > efetch -format xml | \ 118 | > xtract -pattern PubmedArticle -element ISOAbbreviation | \ 119 | > sort-uniq-count-rank 120 | 169 J Chem Inf Model 121 | 53 BMC Bioinformatics 122 | 49 PLoS One 123 | 40 Bioinformatics 124 | 39 Methods Mol Biol 125 | 33 Mol Pharm 126 | 32 Molecules 127 | 29 Sci Rep 128 | 28 Drug Discov Today 129 | 28 J Comput Aided Mol Des 130 | 24 J Med Chem 131 | 23 Expert Opin Drug Discov 132 | 23 Int J Mol Sci 133 | 19 Curr Top Med Chem 134 | 18 Mol Inform 135 | 17 Future Med Chem 136 | 16 Nucleic Acids Res 137 | 15 Nature 138 | 15 PLoS Comput Biol 139 | 14 IEEE/ACM Trans Comput Biol Bioinform 140 | ... 141 | ... 142 | ``` 143 | _tested on 2021.01.27, EDirect 14.4._ 144 | 145 | ### Calculate The Frequency of Author Publications for a University Department in PubMed 146 | 147 | The below script uses `esearch` to query PubMed for ("university of alabama" AND tuscaloosa) in the affiliation field (`[AFFL]`). Tuscaloosa was added to limit the number of retrieved records associated with The University of Alabama at Birmingham and The University of Alabama at Huntsville. Another approach could have been to use the NOT operator: `"(university of alabama[AFFL]) NOT (birmingham[AFFL] OR huntsville[AFFL])"`. However, the latter approach may eliminate any collaborative articles with these institutions (affiliation searches are challenging!). Next, the results were retrieved as XML using `efetch`, followed by piping these results to `xtract` to extract out the publication year (`PubDate/Year`) and sort by frequency with `sort-uniq-count-rank`. Note that a conditional statement was used in the `xtract` pattern to only extract results from articles if the affiliation contains both `chemistry` and `tuscaloosa`. The thought here was that this would limit the results (mostly) to author publications from The University of Alabama (Tuscaloosa) Department of Chemistry: 148 | 149 | ```console 150 | 151 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "(university of alabama[AFFL] AND tuscaloosa[AFFL])" | \ 152 | > efetch -format xml | \ 153 | > xtract -pattern PubmedArticle -if Affiliation -contains chemistry -and Affiliation -contains tuscaloosa -element PubDate/Year | \ 154 | > sort-uniq-count-rank 155 | 65 2015 156 | 64 2020 157 | 59 2017 158 | 53 2018 159 | 49 2019 160 | 45 2016 161 | 41 2014 162 | 35 2013 163 | 30 2012 164 | 28 2008 165 | 26 2006 166 | 23 2007 167 | 23 2010 168 | 20 2004 169 | 19 2003 170 | 18 2009 171 | 17 1999 172 | 17 2001 173 | ... 174 | ... 175 | ``` 176 | 177 | _tested on 2021.01.27, EDirect 14.4._ 178 | 179 | If instead we want to know individual Author numbers in PubMed instead of total publications by year, we can change the `xtract` pattern: 180 | 181 | ```console 182 | 183 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "(university of alabama[AFFL] AND tuscaloosa[AFFL])" | \ 184 | > efetch -format xml | \ 185 | > xtract -pattern Author -if Affiliation -contains chemistry -and Affiliation -contains tuscaloosa -element LastName Initials | \ 186 | > sort-uniq-count-rank 187 | 108 Dixon DA 188 | 53 Vasiliu M 189 | 34 Vincent JB 190 | 32 Rogers RD 191 | 22 Bowman MK 192 | 20 Fang Z 193 | 14 Grant DJ 194 | 14 Thanthiriwatte KS 195 | 13 Cassady CJ 196 | 12 Frantom PA 197 | 11 Chen M 198 | 11 Kelley SP 199 | 11 Shamshina JL 200 | 10 Kispert LD 201 | 10 Metzger RM 202 | 10 Papish ET 203 | 9 Gerlach DL 204 | 9 Li S 205 | 9 Timkovich R 206 | 8 Matus MH 207 | ... 208 | ... 209 | ``` 210 | _tested on 2021.01.27, EDirect 14.4._ 211 | 212 | 213 | Let's take a closer look at the conditional `xtract` pattern specifying to extract data only if the affiliation contains chemistry and tuscaloosa: 214 | 215 | ```console 216 | 217 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "(university of alabama[AFFL] AND tuscaloosa[AFFL])" | \ 218 | > efetch -format xml | \ 219 | > xtract -pattern Author -if Affiliation -contains chemistry -and Affiliation -contains tuscaloosa -element LastName Initials Affiliation 220 | Rowe SJ Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA. 221 | Mecaskey RJ Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA. 222 | Nasef M Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA. 223 | Talton RC Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA. 224 | Sharkey RE Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA. 225 | Halliday JC Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA. 226 | ... 227 | ... 228 | ``` 229 | _tested on 2021.01.27, EDirect 14.4._ 230 | 231 | 232 | With a quick look at the ~1000 results, it seemed like we extracted out the intended data, however, I did notice some false positive results. One example was an article from The University of Alabama (Tuscaloosa) Department of Biological Sciences with an external collaborator having "Chemistry" in the Institution name. Other errors could be what we unintentionally excluded such as any records that do not have Tuscaloosa in the affiliation field (i.e., only a partial address or zip code). These type of affiliation searches are tricky, so test often and think through the results carefully. 233 | 234 | 235 | ### Retrieve Cites and Cited References in PubMed 236 | 237 | The `elink` function can retrieve associated cites and cited references for PubMed records. Cites are the available references in the article (i.e. bibliography list) and cited are references to the article. Not all PubMed articles have associated citation reference data. The available reference data are from the [NIH Open Citation Collection Dataset](https://pubmed.ncbi.nlm.nih.gov/31600197/). 238 | 239 | To retrieve the number of cites for a PubMed article, we can use the `elink` function, followed by `xtract` to extract out the Count element: 240 | 241 | ```console 242 | 243 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29978703[PMID]" | \ 244 | > elink -cites | \ 245 | > xtract -pattern ENTREZ_DIRECT -element Count 246 | 11 247 | ``` 248 | _tested on 2021.01.27, EDirect 14.4._ 249 | 250 | 251 | Add `efetch` to your script if you want to retrieve the records: 252 | 253 | ```console 254 | 255 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29978703[PMID]" | \ 256 | > elink -cites | \ 257 | > efetch -format xml | \ 258 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \ 259 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \ 260 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId 261 | 29382051 Stefanachi A Molecules 2018 23 2 https://doi.org/10.3390%2Fmolecules23020250 262 | 27709885 Stempel E Acc Chem Res 2016 49 11 2390-2402 https://doi.org/10.1021%2Facs.accounts.6b00265 263 | 26661053 James MJ Chemistry 2016 22 9 2856-81 https://doi.org/10.1002%2Fchem.201503835 264 | 26313158 Liu BY Org Lett 2015 17 17 4380-3 https://doi.org/10.1021%2Facs.orglett.5b02230 265 | 22969063 Han X Angew Chem Int Ed Engl 2012 51 41 10390-3 https://doi.org/10.1002%2Fanie.201205238 266 | 18620434 Martin R Acc Chem Res 2008 41 11 1461-73 https://doi.org/10.1021%2Far800036s 267 | ... 268 | ... 269 | ``` 270 | 271 | _tested on 2021.01.27, EDirect 14.4._ 272 | 273 | Getting the cited records only requires changing `-cites` to `-cited`: 274 | 275 | ```console 276 | 277 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29978703[PMID]" | \ 278 | > elink -cited | \ 279 | > efetch -format xml | \ 280 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \ 281 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \ 282 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId 283 | 32537619 Fernandes RA Chem Commun (Camb) 2020 56 61 8569-8590 https://doi.org/10.1039%2Fd0cc02659j 284 | 32317969 Lautié E Front Pharmacol 2020 11 397 https://doi.org/10.3389%2Ffphar.2020.00397 285 | 30707497 Ivanova OA Chem Rec 2019 https://doi.org/10.1002%2Ftcr.201800166 286 | 30259622 Tymann D Angew Chem Int Ed Engl 2018 57 47 15553-15557 https://doi.org/10.1002%2Fanie.201808578 287 | ``` 288 | _tested on 2021.01.27, EDirect 14.4._ 289 | 290 | We can answer some interesting questions with the NIH Open Citation Collection Data. For example, I noticed that the PubMed XML records for articles in *J Cheminform* contain a reference list for articles in PubMed. So, theoretically, if we query PubMed for *J Cheminform*, extract out all of the references, and sort these by frequency, we should get the most cited references in *J Cheminform* article bibliographies (caveat: in the available PubMed citation data). 291 | 292 | In the below script, the `xtract` pattern creates a new line for each extracted reference citation PMID from the ArticleId with pubmed attribute field: 293 | 294 | ```console 295 | 296 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Cheminform[JOUR]" | \ 297 | > efetch -format xml | \ 298 | > xtract -pattern Reference -if ArticleId@IdType -equals pubmed -element ArticleId | \ 299 | > sort-uniq-count-rank | \ 300 | > head -n 20 301 | 90 20426451 302 | 83 21982300 303 | 80 21948594 304 | 65 12653513 305 | 47 11259830 306 | 40 10592235 307 | 38 16796559 308 | 38 27899562 309 | 37 26400175 310 | 36 8709122 311 | 33 17154509 312 | 33 19498078 313 | 30 16381955 314 | 29 21059682 315 | 29 23343401 316 | 28 15667143 317 | 28 24214965 318 | 27 17932057 319 | 27 21425294 320 | 27 22587354 321 | ``` 322 | _tested on 2021.01.27, EDirect 14.4._ 323 | 324 | Note that when quickly viewing all of the sorted results (~10,000 lines), I did see maybe a 100 or so entries with two PMIDs per line or a DOI and a PMID. Since we specifically defined the pubmed IdType attribute, it is not exactly clear to me yet why there would be some extra data in there. Perhaps it is a mistake or inconsistency in the *J Cheminform* PubMed XML records. 325 | 326 | 327 | We can take a quick look at the top 10 cited references using a for loop:: 328 | 329 | ```console 330 | user@computer:~$ for refs in \ 331 | > "20426451" \ 332 | > "21982300" \ 333 | > "21948594" \ 334 | > "12653513" \ 335 | > "11259830" \ 336 | > "10592235" \ 337 | > "16796559" \ 338 | > "27899562" \ 339 | > "26400175" \ 340 | > "8709122" 341 | > do 342 | > esearch -email name@xx.edu -db pubmed -query "$refs[PMID]" | 343 | > efetch -format xml | 344 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \ 345 | > Author/Initials ArticleTitle ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \ 346 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId 347 | > sleep 1 348 | > done 349 | 20426451 Rogers D Extended-connectivity fingerprints. J Chem Inf Model 2010 50 5 742-54 https://doi.org/10.1021%2Fci100050t 350 | 21982300 O'Boyle NM Open Babel: An open chemical toolbox. J Cheminform 2011 3 33 https://doi.org/10.1186%2F1758-2946-3-33 351 | 21948594 Gaulton A ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 2012 40 Database issue D1100-7https://doi.org/10.1093%2Fnar%2Fgkr777 352 | 12653513 Steinbeck C The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics. J Chem Inf Comput Sci 43 2 493-500 https://doi.org/10.1021%2Fci025584y 353 | 11259830 Lipinski CA Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 2001 46 1-3 3-26 https://doi.org/10.1016%2Fs0169-409x%2800%2900129-0 354 | 10592235 Berman HM The Protein Data Bank. Nucleic Acids Res 2000 28 1 235-42 https://doi.org/10.1093%2Fnar%2F28.1.235 355 | 16796559 Steinbeck C Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics.Curr Pharm Des 2006 12 17 2111-20 https://doi.org/10.2174%2F138161206777585274 356 | 27899562 Gaulton A The ChEMBL database in 2017. Nucleic Acids Res 2017 45 D1 D945-D954 https://doi.org/10.1093%2Fnar%2Fgkw1074 357 | 26400175 Kim S PubChem Substance and Compound databases. Nucleic Acids Res 2016 44 D1 D1202-13 https://doi.org/10.1093%2Fnar%2Fgkv951 358 | 8709122 Bemis GW The properties of known drugs. 1. Molecular frameworks.J Med Chem 1996 39 15 2887-93 https://doi.org/10.1021%2Fjm9602928 359 | ``` 360 | 361 | Another interesting question would be what is the most cited Journal in *J Cheminform* articles (in the available PubMed citation data)? In the below script, we take a similar approach to above, but instead of extracting out the PMIDs, we extract out the Citation element. The line `cut -d "." -f 1` deletes any data after the Journal abbreviation (e.g., "Drug Discov Today. 2006 Dec;11(23-24):1046-53" becomes "Drug Discov Today"). 362 | ```console 363 | 364 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Cheminform[JOUR]" | \ 365 | > efetch -format xml | \ 366 | > xtract -pattern Reference -element Citation | \ 367 | > cut -d "." -f 1 | \ 368 | > sort-uniq-count-rank 369 | 2274 J Chem Inf Model 370 | 1268 J Cheminform 371 | 1167 Nucleic Acids Res 372 | 930 J Med Chem 373 | 620 Bioinformatics 374 | 576 J Chem Inf Comput Sci 375 | 425 J Comput Aided Mol Des 376 | 381 BMC Bioinformatics 377 | 351 Drug Discov Today 378 | 252 Mol Inform 379 | 247 J Comput Chem 380 | 236 PLoS One 381 | 222 Nature 382 | 215 Nat Rev Drug Discov 383 | 202 Proc Natl Acad Sci U S A 384 | 181 Science 385 | 148 Anal Chem 386 | 147 Proteins 387 | 140 J Mol Graph Model 388 | ... 389 | ... 390 | ``` 391 | _tested on 2021.01.27, EDirect 14.4._ 392 | 393 | Note that there is some inconsistency in the citation formats results here as well that would need to be evaluated and cleaned up for a more thorough analysis. For example, some of the extracted Citations included author names and article titles, so the `cut` command deleting everything after the first `.` does not suffice for those data entries. 394 | 395 | 396 | ### Number of Records in PubMed by Create Date 397 | 398 | Here is an interesting script to retrieve the count of PubMed records by create date (`[CRDT]`) for each month of 2020. Since there are over 100,000 records added to PubMed every month, a strategy using `efetch` likely would not work (i.e., trying to retrieve 500,000+ records would take a long time). 399 | 400 | ```console 401 | 402 | user@computer:~$ for date in \ 403 | > "2020/01" \ 404 | > "2020/02" \ 405 | > "2020/03" \ 406 | > "2020/04" \ 407 | > "2020/05" \ 408 | > "2020/06" 409 | > do 410 | > esearch -email name@xx.edu -db pubmed -query "$date[CRDT]" | 411 | > xtract -pattern ENTREZ_DIRECT -lbl "$date" -element Count 412 | > sleep 1 413 | > done 414 | 2020/01 108863 415 | 2020/02 107561 416 | 2020/03 106386 417 | 2020/04 124575 418 | 2020/05 121324 419 | 2020/06 124664 420 | ``` 421 | _tested on 2021.01.27, EDirect 14.4._ 422 | 423 | 424 | 425 | ### Number of Records in PubMed that are Also freely Available in PubMed Central 426 | 427 | Let's say we wanted to know how many articles in *J Chem Inf Model* (indexed in PubMed) are available freely in PubMed Central. We can first get a count for *J Chem Inf Model* records in PubMed by querying PubMed in the Journal field (`[JOUR]`), followed by retrieving the records, extracting out the PubDate, and then sorting by frequency: 428 | 429 | ```console 430 | 431 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Chem Inf Model[JOUR]" | \ 432 | > efetch -format docsum | \ 433 | > xtract -pattern DocumentSummary -element PubDate | \ 434 | > cut -d " " -f 1 | \ 435 | > sort-uniq-count-rank | \ 436 | > sort -k2,2 437 | 216 2005 438 | 280 2006 439 | 246 2007 440 | 225 2008 441 | 268 2009 442 | 203 2010 443 | 297 2011 444 | 306 2012 445 | 300 2013 446 | 309 2014 447 | 247 2015 448 | 232 2016 449 | 283 2017 450 | 237 2018 451 | 490 2019 452 | 612 2020 453 | 65 2021 454 | ``` 455 | _tested on 2021.01.27, EDirect 14.4._ 456 | 457 | 458 | In the above script, the line `cut -d " " -f 1` deletes any data appearing after the year and `sort -k2,2` sorts the data by the second column. Next, we can add `elink` into our script to find the linked records in PubMed Central (`pmc`) from the Entrez link `pubmed_pmc`: 459 | 460 | ```console 461 | 462 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Chem Inf Model[JOUR]" | \ 463 | > elink -target pmc -name pubmed_pmc | \ 464 | > efetch -format docsum | \ 465 | > xtract -pattern DocumentSummary -element PubDate | \ 466 | > cut -d " " -f 1 | \ 467 | > sort-uniq-count-rank | \ 468 | > sort -k2,2 469 | 1 2005 470 | 3 2006 471 | 7 2007 472 | 14 2008 473 | 31 2009 474 | 26 2010 475 | 62 2011 476 | 38 2012 477 | 55 2013 478 | 59 2014 479 | 38 2015 480 | 32 2016 481 | 39 2017 482 | 43 2018 483 | 57 2019 484 | 51 2020 485 | ``` 486 | 487 | _tested on 2021.01.27, EDirect 14.4._ 488 | 489 | Note that if you have a query returning tens of thousands of results, you would likely want to use a strategy without `efetch`, such as adding a date into your `esearch` query, followed by extracting out the count element from the XML. 490 | 491 | 492 | -------------------------------------------------------------------------------- /03_EDirect_PubChem_BioAssay_PubMed_Recipes.md: -------------------------------------------------------------------------------- 1 | # PubChem <--> PubChem BioAssay <--> PubMed EDirect Recipes 2 | 3 | **Notes** 4 | 5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`. 6 | > 2. Replace `name@xx.edu` with your email address. 7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal. 8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/). 9 | 10 | ## EDirect PubChem Entrez Links 11 | 12 | ### PubChem Compound --> PubMed Citations 13 | **Description:** Search for a CID in the PubChem Compound Database and retrieve related PubMed linked references. 14 | 15 | In the below script, we use the `esearch` function to query the PubChem Compound database (`pccompound`) for CID 174076 within the Compound ID field, `[uid]`. The `esearch` results are then piped to `elink` finding related PubMed citations via the Entrez link `pccompound_pubmed`. Finally, we retrieve the results with `efetch` in XML format and extract out some bibliographic reference information using the `xtract` function. 16 | 17 | ```console 18 | 19 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 174076[uid] | \ 20 | > elink -target pubmed -name pccompound_pubmed | \ 21 | > efetch -format xml | \ 22 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \ 23 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn 24 | 22957575 Gabl S J Chem Phys 2012 137 9 094501 25 | 22868451 Zhang Y Phys Chem Chem Phys 2012 14 35 12157-64 26 | 22859056 Malberg F Phys Chem Chem Phys 2012 14 35 12079-82 27 | 22852554 Zhang Y J Phys Chem B 2012 116 33 10036-48 28 | 22662183 Zhang BB PLoS ONE 2012 7 5 e37641 29 | ... 30 | ``` 31 | _tested on 2021.01.26, EDirect 14.4, total count was 102._ 32 | 33 | ### PubChem Compound --> PubMed Citations (with filtering) 34 | **Description:** Search for CID in PubChem Compound Database, find related PubMed citations, then only retrieve references from a specific journal. 35 | 36 | We can filter `elink` results with `efilter` to only include PubMed citations (Entrez linked via `pccompound_pubmed`) to the CID but also matching a specific PubMed query. For example, if we are only interested in linked _Phys Chem Chem Phys_ references to CID 174076, we can use the journal field `[JOUR]` in an `efilter` query: 37 | 38 | ```console 39 | 40 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 174076[uid] | \ 41 | > elink -target pubmed -name pccompound_pubmed | \ 42 | > efilter -query "Phys Chem Chem Phys"[JOUR] | \ 43 | > efetch -format xml | \ 44 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \ 45 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn 46 | 22868451 Zhang Y Phys Chem Chem Phys 2012 14 35 12157-64 47 | 22859056 Malberg F Phys Chem Chem Phys 2012 14 35 12079-82 48 | 22451012 Sillars FB Phys Chem Chem Phys 2012 14 17 6094-100 49 | 21643581 Pensado AS Phys Chem Chem Phys 2011 13 30 13518-26 50 | 21643580 Schröder C Phys Chem Chem Phys 2011 13 26 12240-8 51 | ... 52 | ``` 53 | _tested on 2021.01.26, EDirect 14.4, total count was 11._ 54 | 55 | ### PubChem Compound --> PubMed MeSH (with filtering) 56 | **Description:** Search for a CID in PubChem Compound, find related PubMed records via MeSH, and retrieve only references that contain the MeSH subheading "chemical synthesis". 57 | 58 | This is my favorite literature search: start with a PubChem CID and then find PubMed literature related to its synthesis. Similarly to the search above, we can filter out references using an `efilter` query for 'chemical synthesis' as a MeSH subheading `[SUBH]`. Note that we used the `pccompound_pubmed_mesh` Entrez link as the `elink` target name here. 59 | 60 | ```console 61 | 62 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 94257[uid] | \ 63 | > elink -target pubmed -name pccompound_pubmed_mesh | \ 64 | > efilter -query "chemical synthesis"[SUBH] | \ 65 | > efetch -format xml | \ 66 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID ArticleTitle \ 67 | > ISOAbbreviation PubDate/Year 68 | 28463562 Enantioselective Chemical Syntheses of the Furanosteroids (-)-Viridin and (-)-Viridiol. J. Am. Chem. Soc. 2017 69 | 23040731 Viridin analogs derived from steroidal building blocks. Bioorg. Med. Chem. Lett. 2012 70 | 22849426 Synthetic studies on furanosteroids: construction of the viridin core structure via Diels-Alder/retro-Diels-Alder and vinylogous Mukaiyama aldol-type reaction. J. Org. Chem. 2012 71 | 19644878 Abrogation of antibody-induced arthritis in mice by a self-activating viridin prodrug and association with impaired neutrophil and endothelial cell function. Arthritis Rheum. 2009 72 | 19572524 Pentacyclic furanosteroids: the synthesis of potential kinase inhibitors related to viridin and wortmannolone. J. Org. Chem. 2009 73 | ... 74 | ``` 75 | _tested on 2021.01.26, EDirect 14.4, total count was 8._ 76 | 77 | 78 | ### PubChem Compound --> PubMed Citations OR PubMed MeSH 79 | **Description:** Search for a CID in PubChem Compound, find related PubMed citations and related PubMed citations via MeSH. 80 | 81 | It appears that you can combine `elink` queries, with either the same Entrez link or a different Entrez link, but within the same database. For example, if we want to retrieve PubMed literature related to PubChem CID 174076 for both the `pccompound_pubmed` and `pccompound_pubmed_mesh` Entrez links in one dataset, we combine two separate `elink` queries with an OR operator: 82 | 83 | ```console 84 | 85 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 174076[uid] | \ 86 | > elink -target pubmed -name pccompound_pubmed -label pubmed_cit | \ 87 | > esearch -email name@xx.edu -db pccompound -query 174076[uid] | \ 88 | > elink -target pubmed -name pccompound_pubmed_mesh -label pubmed_mesh_cit | \ 89 | > esearch -query "(#pubmed_cit) OR (#pubmed_mesh_cit)" | \ 90 | > efetch -format xml | \ 91 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \ 92 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn 93 | 32231037 Babicka M Molecules 2020 25 7 94 | 31931064 Love SA Int J Biol Macromol 2020 147 569-575 95 | 31818016 Wang F Int J Mol Sci 2019 20 24 96 | 31814059 Weber AL Orig Life Evol Biosph 2019 49 4 199-211 97 | 31675504 Gomez-Herrero E Ecotoxicol Environ Saf 2020 187 109836 98 | 31520950 Pal S Ecotoxicol Environ Saf 2019 184 109634 99 | ... 100 | ``` 101 | _tested on 2021.01.26, EDirect 14.4, total count was 317._ 102 | 103 | 104 | ### PubChem Substance --> PubChem Compound --> PubMed Publisher 105 | **Description:** Search for a PubChem Substance Data Source Depositor, find related same PubChem Compounds, and then retrieve related PubMed references linked via publisher. 106 | 107 | In the below script, we first search the PubChem Substance (`pcsubstance`) database using `esearch` for the data source depositor _Nature Communications_. We can use the Current Source Name `[CSN]` field for this query. Note that an underscore is put in place of the space in the query. This syntax is important for searching in PubChem with the EDirect `esearch` function. After `esearch`, we pipe the results into `elink` twice, first finding related PubChem Compounds via the `pcsubstance_pccompound_same` Entrez link, and then using this new result list to find related PubMed publisher deposited citations from the `pccompound_pubmed_publisher` Entrez link. Finally, similarly to previous searches, we use a combination of `efetch` and `xtract` to retrieve selected data: 108 | 109 | ```console 110 | 111 | user@computer:~$ esearch -email name@xx.edu -db pcsubstance -query "nature_communications"[CSN] | \ 112 | > elink -target pccompound -name pcsubstance_pccompound_same | \ 113 | > elink -target pubmed -name pccompound_pubmed_publisher | \ 114 | > efetch -format xml | 115 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName Author/Initials \ 116 | > ISOAbbreviation PubDate/Year Volume Issue MedlinePgn 117 | 26673265 Gilbert ZW Nat Chem 2016 8 1 63-8 118 | 25424885 Yan T Nat Commun 2014 5 5602 119 | 25422853 Vaidya AB Nat Commun 2014 5 5521 120 | 25382411 Dommerholt J Nat Commun 2014 5 5378 121 | 25382259 Wang B Nat Commun 2014 5 5354 122 | ... 123 | ``` 124 | 125 | _tested on 2021.01.26, EDirect 14.4, total count was 101._ 126 | 127 | 128 | ### PubChem Substance --> PubChem Compound <--> PubMed Publisher 129 | **Description:** Search for a PubChem Substance Data Source Depositor, find related same PubChem Compounds, and then retrieve related PubMed PMIDs linked via publisher. 130 | 131 | Building upon the previous search, if needed, it is possible to obtain individual relationships of the CIDs to PubMed IDs (CID <--> PMID). We can do this using the `-cmd neighbor` option in `elink`: 132 | 133 | ```console 134 | 135 | user@computer:~$ esearch -email name@xx.edu -db pcsubstance -query "nature_communications"[CSN] | \ 136 | > elink -target pccompound -name pcsubstance_pccompound_same | \ 137 | > elink -target pubmed -name pccompound_pubmed_publisher -cmd neighbor | \ 138 | > xtract -pattern LinkSet -element Id 139 | 146033657 24398593 140 | 136286496 141 | 136264969 24177669 142 | 136264968 24177669 143 | 136262920 23385592 144 | 136262919 23385592 145 | 136247006 24457545 146 | 136247005 24457545 147 | 136247004 24457545 148 | 136247003 24457545 149 | 136219971 150 | 135922679 22027590 151 | 91868204 23764831 152 | ... 153 | ``` 154 | _tested on 2021.01.26, EDirect 14.4, total count was 1594 (returns all CIDs, not all have linked PMIDs)._ 155 | 156 | The first column contains the PubChem CIDs and the second column contains the linked PMIDs. Additional linked PMIDs are placed in subsequent columns when available. 157 | 158 | 159 | ### PubChem Compound --> PubChem BioAssay 160 | **Description:** Search for a PubChem CID in PubChem Compound, then retrieve related PubChem active BioAssay data. 161 | 162 | To retrieve BioAssay results labeled as 'Active' that are linked to a CID, we can use the `elink` function with the PubChem BioAssay (`pcassay`) database via Entrez link `pccompound_pcassay_active`. This is followed by `efetch` and `xtract`. In this particular example, we extracted the AID, CurrentSourceName, AssayName, ActiveSidCount, and TargetCount: 163 | 164 | ```console 165 | 166 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "6303"[uid] | \ 167 | > elink -target pcassay -name pccompound_pcassay_active | \ 168 | > efetch -format docsum | \ 169 | > xtract -pattern DocumentSummary -element AID CurrentSourceName AssayName ActiveSidCount TargetCount 170 | 1255098 ChEMBL Inhibition of TLR4-mediated NF-kappaB signaling pathway in BALB/c mouse RAW264.7 cells assessed as suppression of LPS-stimulated PGE2 production at 1 to 10 ug/ml preincubated for 1 hr followed by LPS challenge measured after 6 hrs by immunoblot analysis 1 1 171 | 1255092 ChEMBL Inhibition of TLR4-mediated NF-kappaB signaling pathway in BALB/c mouse RAW264.7 cells assessed as suppression of LPS-stimulated TNF-alpha production at 1 to 10 ug/ml preincubated for 1 hr followed by LPS challenge measured after 6 hrs by immunoblot analysis 1 1 172 | 751324 ChEMBL Inhibition of NFkappaB p65 nuclear translocation in mouse RAW264.7 cells after 24 hrs by DAPI staining-based laser confocal immunofluorescent microscopic analysis 1 1 173 | 174 | ... 175 | ``` 176 | _tested on 2021.01.26, EDirect 14.4, total count was 47._ 177 | 178 | 179 | ### PubChem Compound <--> PubChem BioAssay 180 | **Description:** Search for a PubChem CID in PubChem Compound Database, find related compounds with same connectivity, then retrieve related AIDs for each CID. 181 | 182 | It is possible to obtain individual relationships of the CIDs to BioAssay AIDs (CID <--> AID). We can do this using the `-cmd neighbor` option in `elink`. Note that we first found related compounds with same connectivity using the Entrez link `pccompound_pccompound_sameconnectivity_pulldown`. This step was followed by the `pccompound_pcassay_active` Entrez link in the PubChem BioAssay database to retrieve AID links to the CIDs. We used the 'Active' assay links here. There are also other Entrez PubChem Compound assay links such as inactive, `pccompound_pcassay_inactive`. 183 | 184 | ```console 185 | 186 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "6303"[uid] | \ 187 | > elink -target pccompound -name pccompound_pccompound_sameconnectivity_pulldown | \ 188 | > elink -target pcassay -name pccompound_pcassay_active -cmd neighbor | \ 189 | > xtract -pattern LinkSet -element Id 190 | ... 191 | ... 192 | 6335098 193 | 688425 194 | 451875 1347103 1296009 2551 2546 195 | 248010 687016 652245 651719 177 175 173 171 169 167 165 163 161 159 157 155 153 149 147 196 | 6303 1346987 1259407 1255098 1255092 1207585 1207584 1207579 1207578 1207577 1207576 1167619 1159565 1159562 1159559 1159557 1065715 1065714 1065710 1065706 1065697 1065696 1065695 1065713 1065705 1065699 751324 686979 686978 651820 652245 651719 602346 602250 588511 493002 463218 463212 416870 416743 216185 86858 81069 32353 32352 31719 31718 2467 197 | ``` 198 | _tested on 2021.01.26, EDirect 14.4, total count was 24 CIDs (not all have associated AIDs)._ 199 | 200 | 201 | ## EDirect PubMed Entrez Links 202 | 203 | ### PubMed --> PubChem Compound 204 | **Description:** Search for a PubMed article ID (PMID), then retrieve related PubChem Compounds. 205 | 206 | In the below script, we first use `esearch` to query PubMed for the article ID 29407984 in the `[PMID]` field. This result is then piped into `elink` to retrieve linked compounds in the PubChem Compound database (`pubmed_pccompound`). In this case, there was one compound and we used `efetch` to retrieve the CID record as docsum XML, followed by `xtraxt` to extract out the IsomericSmiles, CID, and InChIKey values. 207 | 208 | ```console 209 | 210 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29407984"[PMID] | \ 211 | > elink -target pccompound -name pubmed_pccompound | \ 212 | > efetch -format docsum | \ 213 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey 214 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O 2764 MYSWGUAQZAJSOK-UHFFFAOYSA-N 215 | 216 | ``` 217 | _tested on 2021.01.26, EDirect 14.4, total count was 1._ 218 | 219 | 220 | ### PubMed --> PubChem Compound (+ mixtures) 221 | **Description:** Search for a PubMed article ID (PMID), then retrieve linked PubChem Compound mixtures/components. 222 | 223 | In this script, an additional `elink` search is added to find related PubChem Mixture/Component compounds via Entrez link `pccompound_pccompound_mixture`. 224 | 225 | ```console 226 | 227 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29407984"[PMID] | \ 228 | > elink -target pccompound -name pubmed_pccompound | \ 229 | > elink -target pccompound -name pccompound_pccompound_mixture | \ 230 | > efetch -format docsum | xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey 231 | ... 232 | ... 233 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O.C(CO)N(CCO)CCO 154963193 NGBBVVPJSFAHHI-UHFFFAOYSA-N 234 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O.C(=O)(C(=O)O)O.[Na] 154963186 HNZYVRRDOWOQIT-UHFFFAOYSA-N 235 | CNC.C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O 154963184 NBGZCMHVBXSHSN-UHFFFAOYSA-N 236 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)[OH2+] 153275427 MYSWGUAQZAJSOK-UHFFFAOYSA-O 237 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)[O-] 152748405 MYSWGUAQZAJSOK-UHFFFAOYSA-M 238 | ... 239 | ... 240 | ``` 241 | _tested on 2021.01.26, EDirect 14.4, total count was 375._ 242 | 243 | 244 | 245 | ### PubMed --> PubChem Compound (MESH search) 246 | **Description:** Search PubMed with a text query, then retrieve linked PubChem Compounds. 247 | 248 | We can also perform text queries in PubMed and retrieve linked PubChem Compounds. Note that in the below script we searched for "ionic liquids" in the `[MESH]` field and Imidazolium in any field. Since this query requires two pairs of quotes, we have to escape the internal quotes in order for the query to be interpreted correctly. The Entrez link `pubmed_pccompound` was used to find related PubChem compounds. 249 | 250 | ```console 251 | 252 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "\"ionic liquids\"[MESH] AND imidazolium" | \ 253 | > elink -target pccompound -name pubmed_pccompound | \ 254 | > efetch -format docsum | \ 255 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey 256 | C1=CC2=CC(=C(C(=C2C(=O)C(=C1)O)O)O)O 135403797 WDGFFVCWBZVLCE-UHFFFAOYSA-N 257 | C1=NC2=C(N1[C@H]3[C@@H]([C@@H]([C@H](O3)CO)O)O)N=C(NC2=O)N 135398635 NYHBQMYGNKIUIF-UUOKFMHZSA-N 258 | CCCCCCCCN1C=C[N+](=C1C2=[N+](C=CN2CCCC)C)C 123995430 DRJFJBHYMOHPHX-UHFFFAOYSA-N 259 | CC(=O)OC1=[N+](C=CN1CC=C)C 123614562 XSXMFLUARQMOLS-UHFFFAOYSA-N 260 | C[N+]1=C(N(C=C1)CCCCCCCCCCCCS)OC(=O)OC2=[N+](C=CN2CCCCCCCCCCCCS)C 123431445 DRJOSAVMFCYCSU-UHFFFAOYSA-P 261 | ... 262 | ``` 263 | _tested on 2021.01.27, EDirect 14.4, total count was 395._ 264 | 265 | ### PubMed --> PubChem Compound (MESH search, and a PubChem filter) 266 | **Description:** Search PubMed with a text query and retrieve only linked compounds containing defined chiral atoms. 267 | 268 | We can also perform some powerful filtering with `efilter`. In the below script, the `[ACDC]` field is the defined atom chiral count in PubChem. A range of 1 through 100 was added for this `[ACDC]` filter. Since it is unlikely that any of the compounds would have near 100 chiral atoms, we can be fairly confident this should capture most, if not all, cases in our search. 269 | 270 | ```console 271 | 272 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "\"ionic liquids\"[MESH] AND imidazolium" | \ 273 | > elink -target pccompound -name pubmed_pccompound | \ 274 | > efilter -query "1:100"[ACDC] | \ 275 | > efetch -format docsum | \ 276 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey 277 | C1=NC2=C(N1[C@H]3[C@@H]([C@@H]([C@H](O3)CO)O)O)N=C(NC2=O)N 135398635 NYHBQMYGNKIUIF-UUOKFMHZSA-N 278 | C([C@H](C(=O)[C@H](CO)O)O)O 54067296 WXYXERHRDKEISL-ZXZARUISSA-N 279 | B(O)(O)OCC(=O)[C@H]([C@@H]([C@@H](CO)O)O)O 53705729 BXAZSOZNRVUIGN-UYFOZJQFSA-N 280 | C([C@H]1[C@@H]([C@H]([C@@H]([C@@H](O1)O[C@@H]2[C@@H](O[C@H]([C@@H]([C@H]2O)O)O)CO)O)O)O)O 46936190 GUBGYTABKSRVRQ-AEDSEYDFSA-N 281 | C1[C@H](OC2=CC(=CC(=C2C1=O)O)OC3C(C(C(C(O3)CO)O)O)O)C4=CC=C(C=C4)O 42607902 DLIKSSGEMUFQOK-CEFFZDIVSA-N 282 | ... 283 | ``` 284 | _tested on 2021.01.27, EDirect 14.4, total count was 43._ 285 | 286 | 287 | ### PubMed --> PubChem Compounds + PubChem Compounds (MeSH) + PubChem Compounds (Publisher) 288 | **Description:** Search PubMed, then find linked PubChem Compounds, PubChem Compounds via PubMed MeSH, and PubChem Compound PubMed Publisher. 289 | 290 | As seen in the previous PubChem searches, there are several Entrez links from PubMed to PubChem Compound such as `pubmed_pccompound`, `pubmed_pccompound_mesh`, and `pubmed_pccompound_publisher`. We can retrieve associated compounds from all three at the same time like this: 291 | 292 | ```console 293 | 294 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "imidazolium AND bacteria" | \ 295 | > elink -target pccompound -name pubmed_pccompound -label compounds_01 | \ 296 | > esearch -email name@xx.edu -db pubmed -query "imidazolium AND bacteria" | \ 297 | > elink -target pccompound -name pubmed_pccompound_mesh -label compounds_02 | \ 298 | > esearch -email name@xx.edu -db pubmed -query "imidazolium AND bacteria" | \ 299 | > elink -target pccompound -name pubmed_pccompound_publisher -label compounds_03 | \ 300 | > esearch -query "(#compounds_01) OR (#compounds_02) OR (#compounds_03)" | \ 301 | > efetch -format docsum | \ 302 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey 303 | C[Si](C)(O)O[Si](C)(C)O.C[Si](CCC1=CC=CC=C1)(O)O[Si](C)(CCC2=CC=CC=C2)O 155288862 HYTMPCHVCUAIOW-UHFFFAOYSA-N 304 | CC(C1CCC(C(O1)OC2C(CC(C(C2O)OC3C(C([C@@](CO3)(C)O)NC)O)N)N)N)NC 146157093 CEAZRRDELHUEMR-NWNXOGAHSA-N 305 | C1=CC=C2C(=C1)C=C(N2)CC3=CC=C(C=C3)C(F)(F)F.C1=CC=C2C(=C1)C=C(N2)CC3=CC=C(C=C3)C(F)(F)F 139191468 NNXWVRXROOQQNO-UHFFFAOYSA-N 306 | CC[C@@H]1[C@@]2([C@@H]([C@H](C(=O)[C@@H](C[C@@]([C@@H]([C@H](C(=O)[C@H](C(=O)O1)C)C)O[C@@H]3[C@@H]([C@H](C[C@H](O3)C)N(C)C)O)(C)OC)C)C)N(C(=O)O2)CCCCN4C=C(N=C4)C5=CN=CC=C5)C 138402871 LJVAJPDWBABPEJ-WMGYHEQLSA-N 307 | CCOC(=O)/C(=N\NC1=CC=CC2=C1N=CC=C2)/C3=[N+](C=CN3)C 136199795 ZHOWQGCGVQGEKD-UHFFFAOYSA-O 308 | ... 309 | ... 310 | ``` 311 | _tested on 2021.01.27, EDirect 14.4, total count was 568._ 312 | 313 | 314 | ### PubMed <--> PubChem Compound 315 | **Description:** Search PubMed for an affiliation, find related PubChem Compounds, then retrieve related CIDs for each PMID. 316 | 317 | If we want to retrieve the PMID <--> CID relationships (for Entrez link `pubmed_pccompound`), we can achieve this using the `-cmd neighbor` option in `elink`: 318 | 319 | ```console 320 | 321 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "(university of alabama[AFFL]) \ 322 | > NOT (birmingham[AFFL] OR huntsville[AFFL])" \ 323 | > -datetype PDAT -mindate 2010 -maxdate 2020 | \ 324 | > elink -target pccompound -name pubmed_pccompound -cmd neighbor | \ 325 | > xtract -pattern LinkSet -element Id 326 | ... 327 | ... 328 | 21800250 329 | 21783326 18679079 8914 942 702 330 | 21782896 331 | 21756136 332 | 21728552 333 | 21718269 334 | 21711000 561577 169577 166929 166928 164636 335 | 21702462 336 | 21693669 337 | 21692575 338 | ... 339 | ... 340 | ``` 341 | 342 | _tested on 2021.01.27, EDirect 14.4, total count was 3639 (returns all PMIDs, not all have linked CIDs)._ 343 | 344 | The first column contains the PMIDs and the second column contains the linked PubChem CIDs (from the `pubmed_pccompound` links). As an aside, the PubMed query for "university of alabama" in the affiliation field (`[AFFL]`) excludes (NOT operator) any results containing huntsville or birmingham in the affiliation. This excludes references from University of Alabama at Birmingham and University of Alabama at Huntsville (including collaborative references with the Tuscaloosa campus). 345 | 346 | 347 | ### PubMed --> PubChem BioAssay 348 | **Description:** Search PubMed for an article, find related PubChem BioAssays, then retrieve some BioAssay data. 349 | 350 | ```console 351 | 352 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "32459468"[PMID] | \ 353 | > elink -target pcassay -name pubmed_pcassay | \ 354 | > efetch -format docsum | \ 355 | > xtract -pattern DocumentSummary -element AID CurrentSourceName AssayName ActiveSidCount TargetCount 356 | 1347414 National Center for Advancing Translational Sciences (NCATS) qHTS to identify inhibitors of the type 1 interferon - major histocompatibility complex class I in skeletal muscle: Secondary screen by immunofluorescence 0 1 357 | 1347412 National Center for Advancing Translational Sciences (NCATS) qHTS assay to identify inhibitors of the type 1 interferon - major histocompatibility complex class I in skeletal muscle: Counter screen cell viability and HiBit confirmation 0 1 358 | 1347415 National Center for Advancing Translational Sciences (NCATS) qHTS to identify inhibitors of the type 1 interferon - major histocompatibility complex class I in skeletal muscle: tertiary screen by RT-qPCR 34 1 359 | 1347413 National Center for Advancing Translational Sciences (NCATS) qHTS to identify inhibitors of the type 1 interferon - major histocompatibility complex class I in skeletal muscle: tertiary screen by RT-qPCR, retest select compounds 3 1 360 | ... 361 | ``` 362 | _tested on 2021.01.27, EDirect 14.4, total count was 7._ 363 | 364 | ### PubMed <--> PubChem BioAssay 365 | **Description:** Search PubMed for an article, find cited articles, then related PubChem BioAssays. 366 | 367 | If we want to retrieve the PMID <--> AID relationships (for Entrez link `pubmed_pcassay`), we can achieve this using the `-cmd neighbor` option in `elink`. Note that here we queried PubMed for an article, then found the cited articles with `elink -cited`, before piping these results into the Entrez link `pubmed_pcassay`. 368 | 369 | 370 | ```console 371 | 372 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17876319"[PMID] | \ 373 | > elink -cited | \ 374 | > elink -target pcassay -name pubmed_pcassay -cmd neighbor | \ 375 | > xtract -pattern LinkSet -element Id 376 | ... 377 | ... 378 | 21167154 379 | 21164511 380 | 21159777 381 | 21138309 568760 568754 568753 568763 568762 568761 568759 568758 568757 568756 568755 382 | 21131971 383 | 21129186 384 | ... 385 | ... 386 | ``` 387 | _tested on 2021.01.27, EDirect 14.4, total count was 366 (returns all PMIDs, not all have linked AIDs)._ 388 | 389 | In the above table, the first column contains the PMIDs, subsequent columns contain the linked BioAssays (AIDs). 390 | 391 | ## EDirect PubChem BioAssay Entrez Links 392 | 393 | ### PubChem BioAssay --> PubMed 394 | **Description:** Search PubChem BioAssay for assays from a specific source name and then find related PubMed literature. 395 | 396 | In the below script, we first use `esearch` to query PubChem BioAssay for IUPHAR/BPS_Guide_to_PHARMACOLOGY in the Source Name field (`[SNME]`). This result is then piped into `elink` to retrieve linked records in the PubMed database (`pcassay_pubmed`). The `efilter` function was used to limit the results to the last 5 years. This resulted in 332 record, and we used `efetch` to retrieve the PubMed records as XML, followed by `xtract` to extract out some bibliographic information. 397 | 398 | ```console 399 | 400 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "IUPHAR/BPS_Guide_to_PHARMACOLOGY"[SNME] | \ 401 | > elink -target pubmed -name pcassay_pubmed | \ 402 | > efilter -mindate 2015 -maxdate 2020 -datetype PDAT | \ 403 | > efetch -format xml | \ 404 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \ 405 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn 406 | 29722898 Fu R Br. J. Pharmacol. 2018 175 14 3034-3049 407 | 29688582 Kato M Br J Clin Pharmacol 2018 84 8 1821-1829 408 | 29683659 Pike KG J. Med. Chem. 2018 61 9 3823-3841 409 | 29674331 Kawaharada S J. Pharmacol. Exp. Ther. 2018 366 1 58-65 410 | 29672049 Gucký T J. Med. Chem. 2018 61 9 3855-3869 411 | 29620892 Nikolaou A J. Med. Chem. 2018 61 8 3697-3711 412 | 29615471 Xu X J. Pharmacol. Exp. Ther. 2018 365 3 624-635 413 | 29608575 Taylor Meadows KR PLoS ONE 2018 13 4 e0193236 414 | ... 415 | ``` 416 | _tested on 2021.01.27, EDirect 14.4, total count was 332._ 417 | 418 | ### PubChem BioAssay --> PubChem Compound 419 | **Description:** Search PubChem BioAssay for an assay, find related PubChem Compounds, and retrieve some property data for the compounds. 420 | 421 | In the below script, we first use `esearch` to query PubChem BioAssay for the assay ID 527855 in the `[UID]` field. This result is then piped into `elink` to retrieve linked compounds in the PubChem Compound database (`pcassay_pccompound`). In this case, there were 16 compounds and we used `efetch` to retrieve the CID records as docsum XML, followed by `xtract` to extract the IsomericSmiles, CID, HydrogenBondDonorCount, HydrogenBondAcceptorCount, MolecularWeight, and XLogP values. 422 | 423 | ```console 424 | 425 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "527855"[UID] | \ 426 | > elink -target pccompound -name pcassay_pccompound | \ 427 | > efetch -format docsum | \ 428 | > xtract -pattern DocumentSummary -element IsomericSmiles CID HydrogenBondDonorCount HydrogenBondAcceptorCount \ 429 | > MolecularWeight XLogP 430 | CN(CC1=CC=CC=C1)C(=O)C2=C(NC(=N2)C3=CC=CC=C3)C(=O)O 52949178 2 4 335.400 2.9 431 | C1=CC=C(C=C1)CCN(CC2=CC=CC=C2)C(=O)C3=C(NC(=N3)C4=CC=CC=C4)C(=O)O 52948352 2 4 425.500 4.8 432 | CN(CC(=O)O)C(=O)C1=C(NC(=N1)C2=CC=CC=C2)C(=O)O 52947957 3 6 303.270 1 433 | C1=CC=C(C=C1)CNC(=O)C2=C(NC(=N2)C3=CC=CC=C3)C(=O)O 52946755 3 4 321.300 2.7 434 | CCOC(=O)CN(CC1=CC=CC=C1)C(=O)C2=C(NC(=N2)C3=CC=CC=C3)C(=O)O 52945544 2 6 407.400 3.2 435 | C1=CC=C(C=C1)CN(CC2=CC=CC=C2)C(=O)C3=C(NC(=N3)C4=CC(=CC=C4)Cl)C(=O)O 52944295 2 4 445.900 5 436 | CCNC(=O)C1=C(NC(=N1)C2=CC=CC=C2)C(=O)O 52941818 3 4 259.260 1.6 437 | CNC(=O)C1=C(NC(=N1)C2=CC=CC=C2)C(=O)O 52941817 3 4 245.230 1.2 438 | ... 439 | ``` 440 | _tested on 2021.01.27, EDirect 14.4, total count was 16._ 441 | 442 | ### PubChem BioAssay <--> PubChem Compound 443 | **Description:** Search PubChem BioAssay for an assay, find related assays based on similar publications, then find related PubChem Compounds. 444 | 445 | If we want to retrieve the AID <--> CID relationships (for Entrez link `pcassay_pccompound`), we can achieve this using the `-cmd neighbor` option in `elink`. Here we queried PubChem BioAssay for an assay, then found related assays by similar publication list using `elink` (`pcassay_pcassay_similar_publication_list`). This result was then piped into the Entrez link `pcassay_pccompound`. 446 | 447 | 448 | ```console 449 | 450 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "527855"[UID] | \ 451 | > elink -target pcassay -name pcassay_pcassay_similar_publication_list | \ 452 | > elink -target pccompound -name pcassay_pccompound -cmd neighbor | \ 453 | > xtract -pattern LinkSet -element Id 454 | ... 455 | ... 456 | 601409 54580326 457 | 601408 54580326 53257623 458 | 601154 54580326 459 | 657046 16093559 460 | 657045 70695880 70693764 70687505 70687504 70687503 70683264 70683263 70681155 70681154 70681153 60150625 461 | 527862 52948352 462 | 527861 52948352 463 | ... 464 | ... 465 | ... 466 | ``` 467 | 468 | In the above table, the first column contains the AIDs, subsequent columns contain the linked PubChem Compounds (CIDs). 469 | 470 | _tested on 2021.01.27, EDirect 14.4, total count was 81 (returns all AIDs, not all have linked CIDs)._ 471 | 472 | -------------------------------------------------------------------------------- /02_EDirect_Data_Fields_Structure.md: -------------------------------------------------------------------------------- 1 | # Available EDirect Databases, Data Fields, and Data Structures 2 | 3 | **Notes** 4 | 5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`. 6 | > 2. Replace `name@xx.edu` with your email address. 7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal. 8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/). 9 | 10 | We can view available Entrez databases, data fields, and links (connected records) with the EDirect `einfo` function. To retrieve a list of all databases, use the `-dbs` argument: 11 | 12 | ```console 13 | 14 | user@computer:~$ einfo -email name@xx.edu -dbs 15 | annotinfo 16 | assembly 17 | biocollections 18 | bioproject 19 | biosample 20 | biosystems 21 | blastdbinfo 22 | books 23 | cdd 24 | clinvar 25 | dbvar 26 | gap 27 | gapplus 28 | gds 29 | gene 30 | genome 31 | geoprofiles 32 | grasp 33 | gtr 34 | homologene 35 | ipg 36 | medgen 37 | mesh 38 | ncbisearch 39 | nlmcatalog 40 | nuccore 41 | nucleotide 42 | omim 43 | orgtrack 44 | pcassay 45 | pccompound 46 | pcsubstance 47 | pmc 48 | popset 49 | protein 50 | proteinclusters 51 | protfam 52 | pubmed 53 | seqannot 54 | snp 55 | sra 56 | structure 57 | taxonomy 58 | 59 | ``` 60 | 61 | ## PubChem Compound EDirect Fields, Links, and Data 62 | 63 | This EDirectChemInfo repository focuses on searching the PubChem Compound, PubMed, and PubChem BioAssay databases. So let's take a closer look at these three databases, starting with the PubChem Compound (`-db pccompound`) database. The `einfo` arguments `-fields` and `-links` provide information about the available data fields and linked information, respectively: 64 | 65 | ```console 66 | 67 | user@computer:~$ einfo -email name@xx.edu -db pccompound -fields 68 | AC ActiveAidCount 69 | ACC AtomChiralCount 70 | ACDC AtomChiralDefCount 71 | ACUC AtomChiralUndefCount 72 | ALL All Fields 73 | BCC BondChiralCount 74 | BCDC BondChiralDefCount 75 | BCUC BondChiralUndefCount 76 | CDAT CreateDate 77 | CPLX Complexity 78 | CSYN CompleteSynonym 79 | CUC CovalentUnitCount 80 | DCNT DepositorCount 81 | DCSY DepositorCompleteSynonym 82 | DSYN DepositorSynonym 83 | ELMT Element 84 | EMAS ExactMass 85 | FILT Filter 86 | HAC HeavyAtomCount 87 | HBAC HydrogenBondAcceptorCount 88 | HBDC HydrogenBondDonorCount 89 | IAC IsotopeAtomCount 90 | IKEY InChIKey 91 | INCH InChI 92 | MMAS MonoisotopicMass 93 | MSHT MeSHTerm 94 | MW MolecularWeight 95 | PAID PharmActionID 96 | PHMA PharmAction 97 | RBC RotatableBondCount 98 | SID SubstanceID 99 | SRCC SourceCategory 100 | SRC SourceName 101 | STID StructureID 102 | SYNO Synonym 103 | TAC TotalAidCount 104 | TFC TotalFormalCharge 105 | TPSA TPSA 106 | UID CompoundID 107 | UPAC IUPACName 108 | XLGP XLogP 109 | 110 | user@computer:~$ einfo -email name@xx.edu -db pccompound -links 111 | pccompound_biosystems BioSystems 112 | pccompound_gene Gene 113 | pccompound_mesh MeSH Keyword 114 | pccompound_nuccore Nucleotide Sequences 115 | pccompound_omim OMIM 116 | pccompound_pcassay BioAssays 117 | pccompound_pcassay_active BioAssays, Active 118 | pccompound_pcassay_activityconcmicromolar BioAssays, activity concentration at/below 1 uM 119 | pccompound_pcassay_activityconcnanomolar BioAssays, activity concentration at/below 1 nM 120 | pccompound_pcassay_inactive BioAssays, Inactive 121 | pccompound_pcassay_probe BioAssays, Probe 122 | pccompound_pccompound Similar Compounds 123 | pccompound_pccompound_3d Similar Conformers 124 | pccompound_pccompound_mixture Mixture/Component Compounds 125 | pccompound_pccompound_parent Parent Compound 126 | pccompound_pccompound_parent_connectivity_pulldown Same Parent, Connectivity 127 | pccompound_pccompound_parent_isotopes_pulldown Same Parent, Isotopes 128 | pccompound_pccompound_parent_pulldown Same Parent 129 | pccompound_pccompound_parent_stereo_pulldown Same Parent, Stereochemistry 130 | pccompound_pccompound_parent_tautomer_pulldown Same Parent, Any Tautomer 131 | pccompound_pccompound_sameanytautomer_pulldown Same, Any Tautomer 132 | pccompound_pccompound_sameconnectivity_pulldown Same, Connectivity 133 | pccompound_pccompound_sameisotopic_pulldown Same, Isotopes 134 | pccompound_pccompound_samestereochem_pulldown Same, Stereochemistry 135 | pccompound_pcsubstance PubChem Mixture Substances 136 | pccompound_pcsubstance_same PubChem Same Substances 137 | pccompound_pmc PMC Articles 138 | pccompound_protein Protein Sequences 139 | pccompound_pubmed PubMed Citations 140 | pccompound_pubmed_mesh PubMed (MeSH Keyword) 141 | pccompound_pubmed_publisher PubMed (Publisher) 142 | pccompound_structure Protein Structures 143 | pccompound_taxonomy Taxonomy 144 | 145 | ``` 146 | Now that we have an understanding about what kind of data is available in the PubChem Compound database, let's take a look at a PubChem Compound record using the `esearch` and `efetch` functions: 147 | 148 | ```console 149 | 150 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 512323[UID] 151 | 152 | pccompound 153 | MCID... 154 | 1 155 | 1 156 | 1 157 | name@xx.edu 158 | 159 | 160 | ``` 161 | 162 | We searched PubChem for the Compound Identifier 512323 using `esearch`, and the NCBI Entrez server returned a summary of the search results. The WebEnV and QueryKey specify the location of the search results on the NCBI server. In order to retrieve the data, we can pipe (`|`) the `esearch` results directly into the `efetch` function: 163 | 164 | ```console 165 | 166 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 512323[UID] | \ 167 | > efetch -format docsum 168 | 169 | 170 | 171 | Build210125-0720m.1 172 | 173 | 512323 174 | 512323 175 | 176 | Chemical Vendors 177 | Governmental Organizations 178 | Subscription Services 179 | Curation Efforts 180 | Journal Publishers 181 | Research and Development 182 | Legacy Depositors 183 | 184 | 2005/08/01 00:00 185 | 186 | CHEMBL1791149 187 | 89647-10-9 188 | Uridine, 2'-deoxy-5-(2-thienyl)- 189 | 5-(2'-Thienyl)-2'-beta-deoxyuridine 190 | SCHEMBL1635430 191 | 5-thien-2-yl-2'-deoxyuridine 192 | CTK2J2646 193 | 5-(2-thienyl)-2'-deoxyuridine 194 | 5-(2'-Thienyl)-2'-deoxyuridine- 195 | BDBM50407986 196 | 1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]-5-thiophen-2-ylpyrimidine-2,4-dione 197 | 1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)tetrahydrofuran-2-yl]-5-(2-thienyl)pyrimidine-2,4-dione 198 | 199 | 1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]-5-thiophen-2-ylpyrimidine-2,4-dione 200 | C1C(C(OC1N2C=C(C(=O)NC2=O)C3=CC=CS3)CO)O 201 | C1[C@@H]([C@H](O[C@H]1N2C=C(C(=O)NC2=O)C3=CC=CS3)CO)O 202 | 3 203 | C13H14N2O5S 204 | 310.330 205 | 310330 206 | 0 207 | -0.2 208 | 3 209 | 6 210 | 483.000 211 | 483000 212 | 21 213 | 3 214 | 3 215 | 0 216 | 0 217 | 0 218 | 0 219 | 0 220 | 1 221 | 127 222 | 1 223 | 37 224 | PCDQBRGMSMVLDZ-IQJOONFLSA-N 225 | 0 226 | InChI=1S/C13H14N2O5S/c16-6-9-8(17)4-11(20-9)15-5-7(10-2-1-3-21-10)12(18)14-13(15)19/h1-3,5,8-9,11,16-17H,4,6H2,(H,14,18,19)/t8-,9+,11+/m0/s1 227 | 228 | 229 | 230 | ``` 231 | 232 | `efetch` returned the CID record data as document summary XML format (for other formats see `efetch -help`). XML is useful, but we probably want to parse the data into a table for easier viewing and analysis. EDirect contains a function called `xtract` that can convert the Entrez XML data into tables. See `xtract -help` for more information. In brief, you will need to select a main XML heading tag to define the extract pattern and then specify the data you want to extract with the sub-heading tag names (elements). For example, in the above CID record data, we can set the pattern to DocumentSummary (the first main XML tag in this case), and then the elements to a few of the sub-heading tags we are interested such as IsomericSmiles, CID, InChIKey, MolecularFormula, and MolecularWeight: 233 | 234 | ```console 235 | 236 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 512323[UID] | \ 237 | > efetch -format docsum | \ 238 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey MolecularFormula MolecularWeight 239 | C1[C@@H]([C@H](O[C@H]1N2C=C(C(=O)NC2=O)C3=CC=CS3)CO)O 512323 PCDQBRGMSMVLDZ-IQJOONFLSA-N C13H14N2O5S 310.330 240 | 241 | ``` 242 | 243 | ## PubMed EDirect Fields, Links, and Data 244 | 245 | Similarly to PubChem Compound, let's preview the available PubMed database indexed fields, links, and data structure: 246 | 247 | ```console 248 | 249 | user@computer:~$ einfo -email name@xx.edu -db pubmed -fields 250 | AFFL Affiliation 251 | ALL All Fields 252 | AUCL Author Cluster ID 253 | AUID Author - Identifier 254 | AUTH Author 255 | BOOK Book 256 | CDAT Date - Completion 257 | CNTY Place of Publication 258 | COIS Conflict of Interest Statements 259 | COLN Author - Corporate 260 | CRDT Date - Create 261 | DSO DSO 262 | ECNO EC/RN Number 263 | EDAT Date - Entrez 264 | ED Editor 265 | EID Extended PMID 266 | EPDT Electronic Publication Date 267 | FAUT Author - First 268 | FILT Filter 269 | FINV Investigator - Full 270 | FULL Author - Full 271 | GRNT Grant Number 272 | INVR Investigator 273 | ISBN ISBN 274 | ISS Issue 275 | JOUR Journal 276 | LANG Language 277 | LAUT Author - Last 278 | LID Location ID 279 | MAJR MeSH Major Topic 280 | MDAT Date - Modification 281 | MESH MeSH Terms 282 | MHDA Date - MeSH 283 | OTRM Other Term 284 | PAGE Pagination 285 | PAPX Pharmacological Action 286 | PDAT Date - Publication 287 | PID Publisher ID 288 | PPDT Print Publication Date 289 | PS Subject - Personal Name 290 | PTYP Publication Type 291 | PUBN Publisher 292 | SI Secondary Source ID 293 | SUBH MeSH Subheading 294 | SUBS Supplementary Concept 295 | TIAB Title/Abstract 296 | TITL Title 297 | TT Transliterated Title 298 | UID UID 299 | VOL Volume 300 | WORD Text Word 301 | 302 | user@computer:~$ einfo -email name@xx.edu -db pubmed -links 303 | pubmed_assembly Assembly 304 | pubmed_bioproject Project Links 305 | pubmed_biosample BioSample Links 306 | pubmed_biosystems BioSystem Links 307 | pubmed_books_refs Cited in Books 308 | pubmed_cdd Conserved Domain Links 309 | pubmed_clinvar_calculated ClinVar (calculated) 310 | pubmed_clinvar ClinVar 311 | pubmed_dbvar dbVar 312 | pubmed_gap dbGaP Links 313 | pubmed_gds GEO DataSet Links 314 | pubmed_gene_bookrecords Gene (from Bookshelf) 315 | pubmed_gene_citedinomim Gene (OMIM) Links 316 | pubmed_gene Gene Links 317 | pubmed_gene_pmc_nucleotide Gene (nucleotide/PMC) 318 | pubmed_gene_rif Gene (GeneRIF) Links 319 | pubmed_genome Genome Links 320 | pubmed_geoprofiles GEO Profile Links 321 | pubmed_homologene HomoloGene Links 322 | pubmed_medgen_bookshelf_cited MedGen (Bookshelf cited) 323 | pubmed_medgen_genereviews MedGen (GeneReviews) 324 | pubmed_medgen MedGen 325 | pubmed_medgen_omim MedGen (OMIM) 326 | pubmed_nuccore Nucleotide Links 327 | pubmed_nuccore_refseq Nucleotide (RefSeq) Links 328 | pubmed_nuccore_weighted Nucleotide (Weighted) Links 329 | pubmed_omim_bookrecords OMIM (from Bookshelf) 330 | pubmed_omim_calculated OMIM (calculated) Links 331 | pubmed_omim_cited OMIM (cited) Links 332 | pubmed_pcassay PubChem BioAssay 333 | pubmed_pccompound_mesh PubChem Compound (MeSH Keyword) 334 | pubmed_pccompound PubChem Compound 335 | pubmed_pccompound_publisher PubChem Compound (Publisher) 336 | pubmed_pcsubstance_bookrecords PubChem Substance (from Bookshelf) 337 | pubmed_pcsubstance PubChem Substance Links 338 | pubmed_pcsubstance_publisher PubChem Substance (Publisher) 339 | pubmed_pmc_bookrecords References in PMC for this Bookshelf citation 340 | pubmed_pmc_embargo 341 | pubmed_pmc_local 342 | pubmed_pmc PMC Links 343 | pubmed_pmc_refs Cited in PMC 344 | pubmed_popset PopSet Links 345 | pubmed_probe Probe Links 346 | pubmed_proteinclusters Protein Cluster Links 347 | pubmed_protein Protein Links 348 | pubmed_protein_refseq Protein (RefSeq) Links 349 | pubmed_protein_weighted Protein (Weighted) Links 350 | pubmed_protfam Protein Family Models 351 | pubmed_pubmed_alsoviewed Articles frequently viewed together 352 | pubmed_pubmed_bookrecords References for this Bookshelf citation 353 | pubmed_pubmed_refs References for PMC Articles 354 | pubmed_pubmed Similar articles 355 | pubmed_snp_cited SNP (Cited) 356 | pubmed_snp SNP Links 357 | pubmed_sra SRA Links 358 | pubmed_structure Structure Links 359 | pubmed_taxonomy_entrez Taxonomy via GenBank 360 | 361 | ``` 362 | 363 | And here is an example PubMed article record in abstract form: 364 | 365 | ```console 366 | 367 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17630804"[PMID] |\ 368 | > efetch -format abstract 369 | 370 | 1. J Org Chem. 2007 Aug 17;72(17):6621-3. Epub 2007 Jul 14. 371 | 372 | Total synthesis and absolute configuration determination of (+)-bruguierol C. 373 | 374 | Solorio DM(1), Jennings MP. 375 | 376 | Author information: 377 | (1)Department of Chemistry, 500 Campus Drive, The University of Alabama, 378 | Tuscaloosa, Alabama 35487-0336, USA. 379 | 380 | The first total synthesis and absolute configuration of bruguierol C are 381 | reported. The key step involved the diastereoselective capture of an in situ 382 | generated oxocarbenium ion via an intramolecular Friedel-Crafts alkylation. 383 | 384 | DOI: 10.1021/jo071035l 385 | PMID: 17630804 [Indexed for MEDLINE] 386 | 387 | ``` 388 | 389 | and the same record in XML format: 390 | 391 | ```console 392 | 393 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17630804"[PMID] |\ 394 | > efetch -format xml 395 | 396 | 397 | 398 | 399 | 400 | 17630804 401 | 402 | 2007 403 | 10 404 | 25 405 | 406 | 407 | 2007 408 | 08 409 | 10 410 | 411 |
412 | 413 | 0022-3263 414 | 415 | 72 416 | 17 417 | 418 | 2007 419 | Aug 420 | 17 421 | 422 | 423 | The Journal of organic chemistry 424 | J Org Chem 425 | 426 | Total synthesis and absolute configuration determination of (+)-bruguierol C. 427 | 428 | 6621-3 429 | 430 | 431 | The first total synthesis and absolute configuration of bruguierol C are reported. The key step involved the diastereoselective capture of an in situ generated oxocarbenium ion via an intramolecular Friedel-Crafts alkylation. 432 | 433 | 434 | 435 | Solorio 436 | Dionicio Martinez 437 | DM 438 | 439 | Department of Chemistry, 500 Campus Drive, The University of Alabama, Tuscaloosa, Alabama 35487-0336, USA. 440 | 441 | 442 | 443 | Jennings 444 | Michael P 445 | MP 446 | 447 | 448 | eng 449 | 450 | Journal Article 451 | Research Support, Non-U.S. Gov't 452 | Research Support, U.S. Gov't, Non-P.H.S. 453 | 454 | 455 | 2007 456 | 07 457 | 14 458 | 459 |
460 | 461 | United States 462 | J Org Chem 463 | 2985193R 464 | 0022-3263 465 | 466 | 467 | 468 | 0 469 | Heterocyclic Compounds, 3-Ring 470 | 471 | 472 | 0 473 | bruguierol C 474 | 475 | 476 | IM 477 | 478 | 479 | Heterocyclic Compounds, 3-Ring 480 | chemical synthesis 481 | chemistry 482 | 483 | 484 | Magnetic Resonance Spectroscopy 485 | 486 | 487 | Molecular Structure 488 | 489 | 490 | Spectrometry, Mass, Electrospray Ionization 491 | 492 | 493 | Spectrophotometry, Infrared 494 | 495 | 496 | Stereoisomerism 497 | 498 | 499 |
500 | 501 | 502 | 503 | 2007 504 | 7 505 | 17 506 | 9 507 | 0 508 | 509 | 510 | 2007 511 | 10 512 | 27 513 | 9 514 | 0 515 | 516 | 517 | 2007 518 | 7 519 | 17 520 | 9 521 | 0 522 | 523 | 524 | ppublish 525 | 526 | 17630804 527 | 10.1021/jo071035l 528 | 529 | 530 |
531 |
532 | 533 | ``` 534 | The above returned XML PubMed record is hard to understand as it has many fields. We can use the `xtract -outline` argument to present a structured view of only the XML data tags: 535 | 536 | ```console 537 | 538 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17630804"[PMID] | \ 539 | > efetch -format xml | \ 540 | > xtract -outline 541 | PubmedArticle 542 | MedlineCitation 543 | PMID 544 | DateCompleted 545 | Year 546 | Month 547 | Day 548 | DateRevised 549 | Year 550 | Month 551 | Day 552 | Article 553 | Journal 554 | ISSN 555 | JournalIssue 556 | Volume 557 | Issue 558 | PubDate 559 | Year 560 | Month 561 | Day 562 | Title 563 | ISOAbbreviation 564 | ArticleTitle 565 | Pagination 566 | MedlinePgn 567 | Abstract 568 | AbstractText 569 | AuthorList 570 | Author 571 | LastName 572 | ForeName 573 | Initials 574 | AffiliationInfo 575 | Affiliation 576 | Author 577 | LastName 578 | ForeName 579 | Initials 580 | Language 581 | PublicationTypeList 582 | PublicationType 583 | PublicationType 584 | PublicationType 585 | ArticleDate 586 | Year 587 | Month 588 | Day 589 | MedlineJournalInfo 590 | Country 591 | MedlineTA 592 | NlmUniqueID 593 | ISSNLinking 594 | ChemicalList 595 | Chemical 596 | RegistryNumber 597 | NameOfSubstance 598 | Chemical 599 | RegistryNumber 600 | NameOfSubstance 601 | CitationSubset 602 | MeshHeadingList 603 | MeshHeading 604 | DescriptorName 605 | QualifierName 606 | QualifierName 607 | MeshHeading 608 | DescriptorName 609 | MeshHeading 610 | DescriptorName 611 | MeshHeading 612 | DescriptorName 613 | MeshHeading 614 | DescriptorName 615 | MeshHeading 616 | DescriptorName 617 | PubmedData 618 | History 619 | PubMedPubDate 620 | Year 621 | Month 622 | Day 623 | Hour 624 | Minute 625 | PubMedPubDate 626 | Year 627 | Month 628 | Day 629 | Hour 630 | Minute 631 | PubMedPubDate 632 | Year 633 | Month 634 | Day 635 | Hour 636 | Minute 637 | PublicationStatus 638 | ArticleIdList 639 | ArticleId 640 | ArticleId 641 | 642 | ``` 643 | 644 | The above structured output makes it easier to view the XML formatting and determine which data elements we are interested in extracting out with `xtract`, such as the PMID, Author/LastName, Author/Initials, ISOAbbreviation, ArticleTitle, PubDate, Volume, Issue, and MedlinePgn: 645 | 646 | ```console 647 | 648 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17630804"[PMID] | \ 649 | > efetch -format xml | \ 650 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName Author/Initials \ 651 | > ISOAbbreviation ArticleTitle PubDate/Year Volume Issue MedlinePgn 652 | 17630804 Solorio DM J. Org. Chem. Total synthesis and absolute configuration determination of (+)-bruguierol C. 2007 72 17 6621-3 653 | 654 | ``` 655 | Note that in the above `xtract` argument, the element selection is specified with `-first`, so that only the first occurrence is extracted (e.g., first Author). 656 | 657 | ## PubChem BioAssay EDirect Fields, Links, and Data 658 | 659 | Lastly, let's take a look at the PubChem BioAssay database indexed fields, links, and data structure: 660 | 661 | ```console 662 | 663 | user@computer:~$ einfo -email name@xx.edu -db pcassay -fields 664 | AC Active Sid Count 665 | ACMD Activity Outcome Method 666 | ACMT Assay Comment 667 | ADES Assay Description 668 | ALL All Fields 669 | ANAM Assay Name 670 | APRJ Assay Project 671 | APRL Assay Protocol 672 | ASRD Assay Source ID 673 | BSID BioSystems ID 674 | CCMT Categorized Comment 675 | CCT Categorized Comment Title 676 | CELL Cell Line 677 | CSNM Current Source Name 678 | DDAT Deposit Date 679 | DTMD Detection Method 680 | FILT Filter 681 | GBAC GenBank Accession 682 | GRN Grant Number 683 | GSYM Gene Symbol 684 | HDAT Hold Until Date 685 | JNAM Journal Name 686 | MDAT Modify Date 687 | NARD Nucleic Acid Reagent ID 688 | NSAM Number of Sids With Activity Concentration micromolar 689 | NSAN Number of Sids With Activity Concentration nanomolar 690 | ORGN Organism 691 | PCC Probe Cid Count 692 | PIGI Pig GI 693 | PSC Probe Sid Count 694 | PTGI Protein Target GI 695 | PTN Protein Target Name 696 | RTGI RNA Target GI 697 | SNME Source Name 698 | SRCC Source Category 699 | SYNT Synonym Tested 700 | TCNT Target Count 701 | TSC Total Sid Count 702 | TXNM Taxonomy Name 703 | UID Assay ID 704 | UPAC UniProt Accession 705 | 706 | user@computer:~$ einfo -email name@xx.edu -db pcassay -links 707 | pcassay_books_probe MLP Chemical Probe Report 708 | pcassay_cdd_protein_target Conserved Domains (Full) via Protein Target 709 | pcassay_gene_rnai RNAi Target, Tested 710 | pcassay_gene_rnai_active RNAi Target, Active 711 | pcassay_gene_target Gene Target 712 | pcassay_nuccore Nucleotide 713 | pcassay_nuccore_rna_target Nucleotide RNA Target 714 | pcassay_omim OMIM 715 | pcassay_pcassay_activityneighbor_list Related BioAssays, by Activity Overlap (List) 716 | pcassay_pcassay_assay_project Related Assay Projects 717 | pcassay_pcassay_common_gene_list Related BioAssays, by Common Active Gene (List) 718 | pcassay_pcassay_gene_interaction_list Related BioAssays, by Gene Interaction (List) 719 | pcassay_pcassay_neighbor_list Related BioAssays, by Depositor (List) 720 | pcassay_pcassay_same_assay_project_list Related BioAssays, by Same Project (List) 721 | pcassay_pcassay_same_publication_list Related BioAssays, by Same Publication (List) 722 | pcassay_pcassay_similar_publication_list Related BioAssays, by Similar Publication (List) 723 | pcassay_pcassay_targetneighbor_list Related BioAssays, by Target Similarity (List) 724 | pcassay_pccompound Compounds 725 | pcassay_pccompound_active Compounds, Active 726 | pcassay_pccompound_activityconcmicromolar Compounds, activity concentration at/below 1 uM 727 | pcassay_pccompound_activityconcnanomolar Compounds, activity concentration at/below 1 nM 728 | pcassay_pccompound_inactive Compounds, Inactive 729 | pcassay_pccompound_probe Compounds, Probe 730 | pcassay_pcsubstance Substances 731 | pcassay_pcsubstance_active Substances, Active 732 | pcassay_pcsubstance_activityconcmicromolar Substances, activity concentration at/below 1 uM 733 | pcassay_pcsubstance_activityconcnanomolar Substances, activity concentration at/below 1 nM 734 | pcassay_pcsubstance_inactive Substances, Inactive 735 | pcassay_pcsubstance_probe Substances, Probe 736 | pcassay_pmc PMC Articles 737 | pcassay_probe Nucleic acid reagent 738 | pcassay_protein_target Protein Target 739 | pcassay_protein_target_pig Protein Target, Identical Sequence 740 | pcassay_pubmed PubMed Citations 741 | pcassay_sparcle_target Target Functional Class 742 | pcassay_structure Protein Structures 743 | pcassay_taxonomy Taxonomy 744 | ``` 745 | 746 | And here is an example PubChem BioAssay record in docsum XML: 747 | 748 | ```console 749 | 750 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "1236573"[UID] | \ 751 | > efetch -format docsum 752 | 753 | 754 | 755 | 756 | Build200617-0952.1 757 | 758 | 1236573 759 | Cytotoxicity against human NCI/ADR cells 760 | 761 | Title: Progress Toward the Development of Noscapine and Derivatives as Anticancer Agents. Abstract: Many nitrogen-moiety containing alkaloids derived from plant origins are bioactive and play a significant role in human health and emerging medicine. Noscapine, a phthalideisoquinoline alkaloid derived from Papaver somniferum, has been used as a cough suppressant since the mid 1950s, illustrating a good safety profile. Noscapine has since been discovered to arrest cells at mitosis, albeit with moderately weak activity. Immunofluorescence staining of microtubules after 24 h of noscapine exposure at 20 inverted question markM elucidated chromosomal abnormalities and the inability of chromosomes to complete congression to the equatorial plane for proper mitotic separation ( Proc. Natl. Acad. Sci. U. S. A. 1998 , 95 , 1601 - 1606 ). A number of noscapine analogues possessing various modifications have been described within the literature and have shown significantly improved antiprolific profiles for a large variety of cancer cell lines. Several semisynthetic antimitotic alkaloids are emerging as possible candidates as novel anticancer therapies. This perspective discusses the advancing understanding of noscapine and related analogues in the fight against malignant disease. 762 | 1506980 763 | 764 | ChEMBL 765 | 766 | ChEMBL 767 | 768 | 1 769 | Confirmatory 770 | 1 771 | No 772 | 2018/10/08 00:00 773 | 2016/12/22 00:00 774 | 1/01/01 00:00 775 | 1236573 776 | 0 777 | 0 778 | 1 779 | 1 780 | 781 | 782 | 783 | 784 | 785 | ``` 786 | Similarly to the PubChem Compound and PubMed data, we can extract out specific data using the `xtract` function such as the AID, CurrentSourceName, AssayName, ActiveSidCount, and TargetCount: 787 | 788 | ```console 789 | 790 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "1236573"[UID] | \ 791 | > efetch -format docsum | \ 792 | > xtract -pattern DocumentSummary -element AID CurrentSourceName AssayName ActiveSidCount TargetCount 793 | 1236573 ChEMBL Cytotoxicity against human NCI/ADR cells 1 0 794 | ``` 795 | 796 | --------------------------------------------------------------------------------