├── images
    ├── mview01.png
    ├── mview02.png
    ├── obabel01.png
    └── obabel02.png
├── Workshops
    └── EDirect Workshop_Scalfani_sp2021.pdf
├── ACS_sp2021_talk
    └── Scalfani_VF_EDirect_ACS_sp2021.pdf
├── LICENSE
├── README.md
├── 01_EDirect_Intro.md
├── 06_EDirect_Combining_Tools.md
├── 04_EDirect_PubChem_Recipes.md
├── 05_EDirect_PubMed_Recipes.md
├── 03_EDirect_PubChem_BioAssay_PubMed_Recipes.md
└── 02_EDirect_Data_Fields_Structure.md


/images/mview01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/images/mview01.png


--------------------------------------------------------------------------------
/images/mview02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/images/mview02.png


--------------------------------------------------------------------------------
/images/obabel01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/images/obabel01.png


--------------------------------------------------------------------------------
/images/obabel02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/images/obabel02.png


--------------------------------------------------------------------------------
/Workshops/EDirect Workshop_Scalfani_sp2021.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/Workshops/EDirect Workshop_Scalfani_sp2021.pdf


--------------------------------------------------------------------------------
/ACS_sp2021_talk/Scalfani_VF_EDirect_ACS_sp2021.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/ACS_sp2021_talk/Scalfani_VF_EDirect_ACS_sp2021.pdf


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Vincent F. Scalfani
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # EDirectChemInfo
 2 | 
 3 | **Notes**
 4 | 
 5 | > Oct 21, 2024 - This repository has recently been transferred from The University of Alabama Libraries Web Services GitHub to The University of Alabama Libraries Research Data Services GitHub organization.
 6 | > All GitHub related hyperlinks should automatically redirect to the new GitHub location, but if you notice anything that is not working correctly, please let us know.
 7 | 
 8 | This repository contains Entrez Direct (EDirect, an NCBI tool) Unix scripts for programmatically obtaining data from various NCBI databases. Other EDirect resources and guides exist (referenced below). This EDirectChemInfo repository differs in that the focus is on teaching how to obtain chemical information, cheminformatics data, and chemical structure <--> bioassay <--> document relationship links. There are not many PubChem EDirect examples available, so hopefully this repository proves useful. I have also added some tips, step-wise directions, and code output examples to help you get started.
 9 | 
10 | Please note that this EDirectChemInfo repository is not affiliated with NCBI. You should contact [NCBI](https://www.ncbi.nlm.nih.gov/books/NBK179288/#_chapter6_For_More_Information_) for specific questions related to EDirect. This repository was created to accompany library instruction at The University of Alabama. With that in mind, please feel free to open a GitHub Issue or contact me directly with comments/questions if you think there is something I can help you with. In addition, if this repository has been a useful resource for you, please do let me know as this type of feedback can help prioritize my time.
11 | 
12 | Vincent Scalfani\
13 | Science and Engineering Librarian\
14 | The University of Alabama\
15 | [UA Libraries Directory](https://www.lib.ua.edu/#/staffdir?liaison=1&search=scalfani)
16 | 
17 | ## Contents
18 | 
19 | * [What is EDirect?](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md)
20 |   * [Installation Tips](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md#installation-tips)
21 |   * [Usage Tips](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md#usage-tips)
22 |   * [EDirect Function Help and Debug](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md#e-utility-application-help)
23 | * [Available Databases, Data Fields, and Data Structures](https://github.com/vfscalfani/EDirectChemInfo/blob/master/02_EDirect_Data_Fields_Structure.md)
24 | * [PubChem <--> PubChem BioAssay <--> PubMed EDirect Recipes](https://github.com/vfscalfani/EDirectChemInfo/blob/master/03_EDirect_PubChem_BioAssay_PubMed_Recipes.md)
25 | * [PubChem EDirect Recipes](https://github.com/vfscalfani/EDirectChemInfo/blob/master/04_EDirect_PubChem_Recipes.md)
26 | * [PubMed EDirect Recipes](https://github.com/vfscalfani/EDirectChemInfo/blob/master/05_EDirect_PubMed_Recipes.md)
27 | * [Combining EDirect Results with Chemical Depiction and Plotting](https://github.com/vfscalfani/EDirectChemInfo/blob/master/06_EDirect_Combining_Tools.md)
28 | 
29 | ## References
30 | 
31 | These are the main references I used to learn about NCBI E-Utilities, the EDirect syntax, Unix commands/scripts, and the importance of linked chemical data. Many thanks to the authors for their work.
32 | 
33 | 1. [NCBI Documentation for Entrez Direct: E-utilities on the UNIX Command Line](https://www.ncbi.nlm.nih.gov/books/NBK179288/)
34 | 2. [NIH NLM The Insider's Guide to Accessing NLM Data](https://dataguide.nlm.nih.gov/)
35 | 3. [NCBI EDirect Cookbook](https://github.com/NCBI-Hackathons/EDirectCookbook)
36 | 4. [Computational Genomics Manual: NCBI EDirect](https://github.com/linsalrob/ComputationalGenomicsManual/blob/master/Databases/NCBI_Edirect.md)
37 | 5. [Entrez Link Descriptions](https://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html)
38 | 6. [Software Carpentry: The Unix Shell](https://swcarpentry.github.io/shell-novice/)
39 | 7. [Opening up connectivity between documents, structures and bioactivity by Christopher Southan](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7136548/)
40 | 
41 | 
42 | ## License Notes
43 | 
44 | Code in this repository is licensed under the [MIT License](https://github.com/vfscalfani/EDirectChemInfo/blob/master/LICENSE). Some of the chemical depiction demonstrations from EDirect output use proprietary software, such as ChemAxon Marvin, which is not included under this license. Users must have valid licenses for any required proprietary software to run these portions of the code.
45 | 
46 | Code output (e.g., reference/molecular data snippets) retrieved from NCBI via their EDirect utility is shown for code demonstration purposes only and is credited to NCBI and NLM. Please see the [NCBI Website and Data Usage Policies and Disclaimers](https://www.ncbi.nlm.nih.gov/home/about/policies/) for more information regarding the data.
47 | 
48 | 


--------------------------------------------------------------------------------
/01_EDirect_Intro.md:
--------------------------------------------------------------------------------
  1 | # What is EDirect?
  2 | 
  3 | EDirect is a Unix command line tool from NCBI that allows programmatic retrieval of chemical/biological data and literature references from NCBI databases. EDirect reduces the barrier to accessing NCBI data programmatically; that is, with a basic knowledge of the Unix shell (e.g., bash), it is straightforward to obtain and format your own custom datasets, often with only a few lines of code. Moreover, you can input data retrieved from EDirect into other Unix tools for quick viewing and analysis ([Pipeline (Unix))](https://en.wikipedia.org/wiki/Pipeline_(Unix)).
  4 | 
  5 | ## Installation Tips
  6 | 
  7 | Follow the installation instructions from NCBI: [Entrez Direct: E-utilities on the UNIX Command Line](https://www.ncbi.nlm.nih.gov/books/NBK179288/). There are several different methods to install EDirect. I used option 3 (EDirect v14.4) with `wget` in Gnome Terminal on a Linux Ubuntu 18.04 workstation. If you are using Windows, NCBI mentions that you can use the Cygwin Unix emulator. Another option for Windows users is to setup a Linux virtual machine. There are many tutorials for setting up virtual machines. For example, here is one for installing [Ubuntu on VirtualBox](https://askubuntu.com/questions/142549/how-to-install-ubuntu-on-virtualbox). When installing EDirect in a virtual machine, you may need to customize the VirtualBox network settings in order to use the `curl` or `wget` EDirect installation methods. In my testing on an Ubuntu 20.04 virtual machine, the fourth installation option for EDirect (using the longer perl script) worked fine with the standard VirtualBox network settings.
  8 | 
  9 | ## Usage Tips
 10 | 
 11 | NCBI has specific data usage policies and disclaimers:
 12 | 
 13 | * [NCBI Website and Data Usage Policies and Disclaimers](https://www.ncbi.nlm.nih.gov/home/about/policies/)
 14 | * [Entrez Programming Utilities Help](https://www.ncbi.nlm.nih.gov/books/NBK25501/)
 15 | 
 16 | If you do not follow NCBI's usage policies (e.g., no more than 3 requests per second), NCBI may block your IP address. So be cautious and follow good programming practices of testing and adding sleep delays, particularly if executing multiple sequential calls in a loop. Moreover, it is always a good idea to include your email address in the requests so that NCBI can contact you if necessary. You can add your email address within each query like this:
 17 | 
 18 | ```console
 19 | 
 20 | user@computer:~$ e-function -email name@xx.edu -arg input
 21 | 
 22 | ```
 23 | Replace `name@xx.edu` with your email address. The `e-function` is a place holder for one of the actual EDirect functions like `einfo` or `esearch`, and `-arg input` is a placeholder for e-function argument(s) like `-db pccompound` or `-db pubmed -query "food allergies"`.
 24 | 
 25 | ## EDirect Function Help
 26 | 
 27 | I generally refer to the official [Entrez Programming Utilities Help Document](https://www.ncbi.nlm.nih.gov/books/NBK25501/) or the [NIH NLM E-Utilities Documentation](https://dataguide.nlm.nih.gov/eutilities/utilities.html), however for a quick reference or reminder of the proper syntax, the `-help` option is useful. Here is an example with the `einfo` function:
 28 | 
 29 | ```console
 30 | 
 31 | user@computer:~$ einfo -help
 32 | einfo 14.4
 33 | 
 34 | Database Selection
 35 | 
 36 |   -dbs       Print all database names
 37 |   -db        Database name (or "all")
 38 | 
 39 | Data Summaries
 40 | 
 41 |   -fields    Print field names
 42 |   -links     Print link names
 43 | 
 44 | Field Example
 45 | 
 46 |   <Field>
 47 |     <Name>ALL</Name>
 48 |     <FullName>All Fields</FullName>
 49 |     <Description>All terms from all searchable fields</Description>
 50 |     <TermCount>245340803</TermCount>
 51 |     <IsDate>N</IsDate>
 52 |     <IsNumerical>N</IsNumerical>
 53 |     <SingleToken>N</SingleToken>
 54 |     <Hierarchy>N</Hierarchy>
 55 |     <IsHidden>N</IsHidden>
 56 |     <IsTruncatable>Y</IsTruncatable>
 57 |     <IsRangable>N</IsRangable>
 58 |   </Field>
 59 | 
 60 | Link Example
 61 | 
 62 |   <Link>
 63 |     <Name>pubmed_protein</Name>
 64 |     <Menu>Protein Links</Menu>
 65 |     <Description>Published protein sequences</Description>
 66 |     <DbTo>protein</DbTo>
 67 |   </Link>
 68 |   <Link>
 69 |     <Name>pubmed_protein_refseq</Name>
 70 |     <Menu>Protein (RefSeq) Links</Menu>
 71 |     <Description>Link to Protein RefSeqs</Description>
 72 |     <DbTo>protein</DbTo>
 73 |   </Link>
 74 | 
 75 | ```
 76 | 
 77 | ## EDirect Query Translation via Debug Flag
 78 | 
 79 | When experimenting with searches in EDirect, it is often helpful to view the interpreted query. This can be accomplished using the `-debug` flag in EDirect 14.4 (thanks to NLM Support for the explanation and tip!):
 80 | 
 81 | ```console
 82 | 
 83 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "hydrogel-based drug delivery" -debug
 84 | nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi -retmax 0 -usehistory y -db pubmed -term "hydrogel-based drug delivery" -tool edirect -edirect 14.4 -edirect_os Linux -email name@xx.edu
 85 | <ENTREZ_DIRECT>
 86 |   <Db>pubmed</Db>
 87 |   <WebEnv>MCID...</WebEnv>
 88 |   <QueryKey>1</QueryKey>
 89 |   <Count>436</Count>
 90 |   <Step>1</Step>
 91 |   <Email>name@xx.edu</Email>
 92 |   <Debug>Y</Debug>
 93 | </ENTREZ_DIRECT>
 94 | 
 95 | ```
 96 | Next, copy and run the `nquire` command and pipe the results to `xtract`, extracting out the QueryTranslation element:
 97 | 
 98 | ```console
 99 | 
100 | user@computer:~$ nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi -retmax 0 -usehistory y -db pubmed -term "hydrogel-based drug delivery" | xtract -pattern eSearchResult -element QueryTranslation
101 | hydrogel-based[All Fields] AND ("drug delivery systems"[MeSH Terms] OR ("drug"[All Fields] AND "delivery"[All Fields] AND "systems"[All Fields]) OR "drug delivery systems"[All Fields] OR ("drug"[All Fields] AND "delivery"[All Fields]) OR "drug delivery"[All Fields])
102 | 
103 | ```
104 | 
105 | 


--------------------------------------------------------------------------------
/06_EDirect_Combining_Tools.md:
--------------------------------------------------------------------------------
  1 | # Combining EDirect Results with Chemical Depiction and Plotting
  2 | 
  3 | **Notes**
  4 | 
  5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`.
  6 | > 2. Replace `name@xx.edu` with your email address.
  7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal.
  8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/).
  9 | 
 10 | ## EDirect --> Chemical Depiction and Plots
 11 | 
 12 | It is possible to pipe EDirect results into chemical structure viewers as some cheminformatics toolkits can read chemical file formats (e.g., SMILES) directly from standard input.
 13 | 
 14 | ### ChemAxon MarvinView Chemical Depiction
 15 | 
 16 | For [ChemAxon Marvin](https://chemaxon.com/products/marvin), we can pipe EDirect compiled SMILES directly into Marvin View (`mview`). Note that the `-` is the `mview` option to read structures from standard input.
 17 | 
 18 | ```console
 19 | 
 20 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "132427739"[UID] | \
 21 | > efetch -format docsum | \
 22 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName | \
 23 | > mview -
 24 | 
 25 | ```
 26 | 
 27 | ![mview01](/images/mview01.png)
 28 | 
 29 | If you have multiple molecules to display, you can use the `mview` standard input option `-` along with the `gridbag` option to display the molecules in a matrix:
 30 | 
 31 | 
 32 | ```console
 33 | 
 34 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "132427739"[UID] | \
 35 | > elink -target pccompound -name pccompound_pccompound | \
 36 | > efetch -format docsum | \
 37 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName | \
 38 | > mview --gridbag -
 39 | 
 40 | ```
 41 | 
 42 | ![mview02](/images/mview02.png)
 43 | 
 44 | _tested with ChemAxon MarvinView version 19.27.0._
 45 | 
 46 | ### Open Babel Chemical Depiction
 47 | 
 48 | One really cool feature of using [Open Babel](https://github.com/openbabel) is the ability to display molecules as ASCII figures directly in the terminal. Below, we pipe the results to Open Babel using the standard input smiles format, `-ismi`, and then output in ascii format, `-oascii`. The `-xh 10` is a resizing option.
 49 | 
 50 | ```console
 51 | 
 52 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "13586"[UID] | \
 53 | > efetch -format docsum | \
 54 | > xtract -pattern DocumentSummary -element IsomericSmiles | \
 55 | > openbabel.obabel -ismi -oascii -xh 10
 56 |                    __                                                          
 57 |                   __ \__                                                       
 58 |                 _/  \__ \_                                                     
 59 |              __/       \_ \_                                                   
 60 |  O  ________/            \_                                                    
 61 |          /                                                                     
 62 |         /                                                                      
 63 |    \    |                                                                      
 64 |     \  /                                                                       
 65 |      \/                                                                        
 66 | 1 molecule converted
 67 | ```
 68 | 
 69 | This works for multiple molecules too!
 70 | 
 71 | ```console
 72 | 
 73 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "132427739"[UID] | \
 74 | > elink -target pccompound -name pccompound_pccompound | \
 75 | > efetch -format docsum | \
 76 | > xtract -pattern DocumentSummary -element IsomericSmiles | \
 77 | > openbabel.obabel -ismi -oascii -xh 10
 78 |                \                                                               
 79 |               ____                                                             
 80 |                 |                                                              
 81 |          __  O_Si__                                                            
 82 |       O\  \__\  |                                                              
 83 |        |\N /  \                                                                
 84 |     __O|  \__/                                                                 
 85 |  /__/      \ OH                                                                
 86 | |  \                                                                           
 87 | /|  /                                                                          
 88 |              /                                                                 
 89 |     /       |                                                                  
 90 |  \_ |    O ____                                                                
 91 |    \__O___  |____                                                              
 92 | __Si_O   N \\/   \                                                             
 93 |      \      /|  /                                                              
 94 |    |  \___   __N                                                               
 95 |                 \__O    ____                                                   
 96 |                O /  ___\/  \\                                                  
 97 |                 /      /|  /                                                   
 98 |                        __   __                                                 
 99 |                          \_/                                                   
100 |                           |    /                                               
101 |                           |    |                                               
102 |   /____               O __Si__/                                                
103 |  |/   \              |    |    |                                               
104 |  /     \  O\\     ____   _|    \                                               
105 | \  ___ |   ||     /   \_/  \__                                                 
106 |  \____/     \\_N \     /                                                       
107 |   \   |     _/    |   /                                                        
108 |  ___                                                                           
109 | /  \/                                                                          
110 | || |\                                                                          
111 |  \__                                                                           
112 |     \_O    ___                                                                 
113 |        |_N/  \   /                                                             
114 |       O|/ _\__\_/                                                              
115 |        / /_/ O_Si__                                                            
116 |          \_                                                                    
117 |            \O   |                                                              
118 |            /                                                                   
119 |        _Si/                                                                    
120 |         \ |_                                                                   
121 |    O____O | \                                                                  
122 | __|Si_ |  |     ____                                                           
123 |   /     \_\ __  |  |                                                           
124 |  /  |   /  \____N__|                                                           
125 |              \O|  |  /  \_                                                     
126 |                   \O/\___ \                                                    
127 |                    \   | ||                                                    
128 |               \_ /                                                             
129 |  ___            \                                                              
130 | /  \         O_Si_|                                                            
131 |  |__\    \   \  \ |                                                            
132 |    /\_O  \__\  _\                                                              
133 |       __N /  \/  \                                                             
134 |   \    / /   |                                                                 
135 |   |   O  \__/                                                                  
136 |    \   ___\                                                                    
137 |    ____/        
138 | 
139 | ...
140 | 24 molecules converted
141 | ```
142 | 
143 | You can save depictions in a more classic PNG file using Open Babel with either a single molecule:
144 | 
145 | ```console
146 | 
147 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "13586"[UID] | \
148 | > efetch -format docsum | \
149 | > xtract -pattern DocumentSummary -element IsomericSmiles CID | \
150 | > openbabel.obabel -ismi -O 13586.png
151 | 1 molecule converted
152 | ```
153 | 
154 | ![obabel01](/images/obabel01.png)
155 | 
156 | 
157 | or multiple molecules in a matrix:
158 | 
159 | ```console
160 | 
161 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "132427739"[UID] | \
162 | > elink -target pccompound -name pccompound_pccompound | \
163 | > efetch -format docsum | \
164 | > xtract -pattern DocumentSummary -element IsomericSmiles CID | \
165 | > openbabel.obabel -ismi -O 132427739_similar.png -xp 1400
166 | 24 molecules converted
167 | ```
168 | 
169 | ![obabel02](/images/obabel02.png)
170 | 
171 | _Tested with Open Babel v3.0.0 installed from Snap. I did receive a Font Configuration error when saving the PNG files, however, the conversion seemed to work fine._
172 | 
173 | ### gnuplot Data plotting
174 | 
175 | [gnuplot](http://www.gnuplot.info/) is a command-line graphing program that allows plotting data from standard input. In gnuplot, there is an option called "dumb terminal" that creates plots using ASCII characters directly in the terminal window, which is convenient for initial analysis of compiled EDirect data. For example, here is some data related to the number of *J Cheminform* articles indexed in PubMed by publication date:
176 | 
177 | ```console
178 | 
179 | user@computer:~$  esearch -email name@xx.edu -db pubmed -query "J Cheminform[JOUR]" | \
180 | > efetch -format docsum | \
181 | > xtract -pattern DocumentSummary -element PubDate | \
182 | > cut -d " " -f 1 | \
183 | > sort-uniq-count-rank | \
184 | > sort -k2
185 | 22	2009
186 | 12	2010
187 | 54	2011
188 | 39	2012
189 | 52	2013
190 | 71	2014
191 | 78	2015
192 | 71	2016
193 | 67	2017
194 | 68	2018
195 | 56	2019
196 | ```
197 | We can pipe this data directly to gnuplot. In the below script, `set term dumb` is the gnuplot option to create an ASCII plot, `-` sets the data input to standard input instead of a file, `using 2:1` sets the second column as the x-axis, and the first column as the y-axis, `with boxes` creates a box plot, and `notitle` removes the plot legend:
198 | 
199 | ```console
200 | 
201 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Cheminform[JOUR]" | \
202 | > efetch -format docsum | \
203 | > xtract -pattern DocumentSummary -element PubDate | \
204 | > cut -d " " -f 1 | \
205 | > sort-uniq-count-rank | \
206 | > sort -k2 | \
207 | > gnuplot -e "set term dumb; plot '-' using 2:1 with boxes notitle"
208 | 
209 |                                                                                
210 |   80 +---------------------------------------------------------------------+   
211 |      |           +          +           +  *******  +          +           |   
212 |      |                                     *     *                         |   
213 |   70 |-+                             *******     *******    *******      +-|   
214 |      |                               *     *     *     ******     *        |   
215 |      |                               *     *     *     *    *     *        |   
216 |   60 |-+                             *     *     *     *    *     *      +-|   
217 |      |              ******           *     *     *     *    *     *******  |   
218 |      |              *    *     *******     *     *     *    *     *     *  |   
219 |   50 |-+            *    *     *     *     *     *     *    *     *     *+-|   
220 |      |              *    *     *     *     *     *     *    *     *     *  |   
221 |   40 |-+            *    *     *     *     *     *     *    *     *     *+-|   
222 |      |              *    *******     *     *     *     *    *     *     *  |   
223 |      |              *    *     *     *     *     *     *    *     *     *  |   
224 |   30 |-+            *    *     *     *     *     *     *    *     *     *+-|   
225 |      |              *    *     *     *     *     *     *    *     *     *  |   
226 |      |              *    *     *     *     *     *     *    *     *     *  |   
227 |   20 |-+*******     *    *     *     *     *     *     *    *     *     *+-|   
228 |      |  *     *     *    *     *     *     *     *     *    *     *     *  |   
229 |      |  *     *******    *  +  *     *  +  *     *  +  *    *  +  *     *  |   
230 |   10 +---------------------------------------------------------------------+   
231 |     2008        2010       2012        2014        2016       2018        2020 
232 | 
233 | ```
234 | 
235 | _Tested with gnuplot-x11 5.2.8._
236 | 
237 | 
238 | 


--------------------------------------------------------------------------------
/04_EDirect_PubChem_Recipes.md:
--------------------------------------------------------------------------------
  1 | # PubChem EDirect Recipes
  2 | 
  3 | **Notes**
  4 | 
  5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`.
  6 | > 2. Replace `name@xx.edu` with your email address.
  7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal.
  8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/).
  9 | 
 10 | ## PubChem EDirect
 11 | 
 12 | ### Search PubChem Compound via InChIKey and Retrieve Data
 13 | 
 14 | In the below script, we first query the PubChem Compound database (`pccompound`) for "NJTXJDYZPQNTSM-WMZOPIPTSA-N" in the InChIKey (`[IKEY]`) field. Next, the record is retrieved in XML docsum and several properties are extracted with the `xtract` function including the IsomericSmiles, CID, InChIKey, and IUPACName.
 15 | 
 16 | ```console
 17 | 
 18 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "NJTXJDYZPQNTSM-WMZOPIPTSA-N"[IKEY] | \
 19 | > efetch -format docsum | \
 20 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName
 21 | C[C@]12CCC(=O)C=C1CCC[C@@H]2OC(=O)C3=CC=CC=C3	11044292	NJTXJDYZPQNTSM-WMZOPIPTSA-N	[(1S,8aS)-8a-methyl-6-oxo-1,2,3,4,7,8-hexahydronaphthalen-1-yl] benzoate
 22 | ```
 23 | _tested on 2021.01.27, EDirect 14.4, total count was 1._
 24 | 
 25 | ### Search PubChem Compound with a list of CIDs and Retrieve Data
 26 | 
 27 | If we have a small list of PubChem Compound Identifiers (CIDs) and need to retrieve specific data for each CID, we can write a for loop directly in the terminal. Note that in the below Bash script, I added a sleep of one second within the loop in an effort to not overload the NCBI servers.
 28 | 
 29 | ```console
 30 | 
 31 | user@computer:~$ for myCID in \
 32 | >    "146021325" \
 33 | >    "11068043" \
 34 | >    "11615487" \
 35 | >    "10056179" \
 36 | >    "169731"
 37 | > do
 38 | >    esearch -email name@xx.edu -db pccompound -query "$myCID[UID]" |
 39 | >    efetch -format docsum |
 40 | >    xtract -pattern DocumentSummary -lbl "$myCID" -element IsomericSmiles InChIKey MolecularFormula MolecularWeight
 41 | >    sleep 1
 42 | > done
 43 | 146021325	CN1C(=CN=C1Cl)C(/C=C/C2=CC=CC=C2)(C3=CC=CC=C3)O	AKSFJXCUMHAJKP-OUKQBFOZSA-N	C19H17ClN2O	324.800
 44 | 11068043	CC(C)[Si](C(C)C)(C(C)C)OC(CCCC1=CCC=CC1)CC=C	JDKBJINLKCZQNX-UHFFFAOYSA-N	C22H40OSi	348.600
 45 | 11615487	CC1=CC(=C(C=C1)NC(=O)C(C)(C)C)OC	WMXHBZHMNCGLQQ-UHFFFAOYSA-N	C13H19NO2	221.290
 46 | 10056179	CC(=O)N1CN2C3=CC=CC=C3C(=C2C4=CC=CC=C41)C(C(=O)NCC[Se]C5=CC=CC=C5)O	MJWHOMSECISGAK-UHFFFAOYSA-N	C27H25N3O3Se	518.500
 47 | 169731	C1=CC=C2C(=C1)C=C(N2)CC#N	RORMSTAFXZRNGK-UHFFFAOYSA-N	C10H8N2	156.180
 48 | ```
 49 | _tested on 2021.01.27, EDirect 14.4, total count was 5 (as expected in the for loop)._
 50 | 
 51 | ### Retrieve Pre-Computed Linked Similar Compounds
 52 | 
 53 | In the below script, we use the `esearch` function to query the PubChem Compound database (`pccompound`) for CID 11044292 within the Compound ID field (`[uid]`). The `esearch` results are then piped to `elink` finding related PubChem Compounds via the Entrez link `pccompound_pccompound`.
 54 | 
 55 | ```console
 56 | 
 57 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "11044292"[UID] | \
 58 | > elink -target pccompound -name pccompound_pccompound | \
 59 | > efetch -format docsum | \
 60 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName
 61 | ...
 62 | ...
 63 | CC(C1=CC=CC=C1)OC(=O)C2=CC=C(C=C2)C(C)(C)C	152679150	ZNDWBMXQOLFTJJ-UHFFFAOYSA-N	1-phenylethyl 4-tert-butylbenzoate
 64 | CCCCC1(CCC(C(C1)OC(=O)C2=CC=CC=C2)C(C)C)C	152242148	WDOHMQDXGOVKBZ-UHFFFAOYSA-N	(5-butyl-5-methyl-2-propan-2-ylcyclohexyl) benzoate
 65 | CCC1=CC=CC=C1C(=O)OC2C=CC(=O)CC2(C)C	150893175	KYLDSGZLUQKTGC-UHFFFAOYSA-N	(6,6-dimethyl-4-oxocyclohex-2-en-1-yl) 2-ethylbenzoate
 66 | CC1CCC(C(CC1=O)(C)C)OC(=O)C2=CC=CC=C2	150335011	GQMHTGLOWBLGSB-UHFFFAOYSA-N	(2,2,5-trimethyl-4-oxocycloheptyl) benzoate
 67 | ...
 68 | ...
 69 | ```
 70 | _tested on 2021.01.27, EDirect 14.4, total count was 238._
 71 | 
 72 | ### Find Compounds with Specific Attributes
 73 | 
 74 | There are a variety of methods to limit results and find compounds with specific attributes in PubChem Compound. The below script, for example, uses the `efilter` function to limit the `elink` similarity results to compounds with active assays using the query "pccompound_pcassay_active" in the filter (`[FILT]`) field:
 75 | 
 76 | ```console
 77 | 
 78 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "11044292"[UID] | \
 79 | > elink -target pccompound -name pccompound_pccompound | \
 80 | > efilter -query "pccompound_pcassay_active"[FILT] | \
 81 | > efetch -format docsum | \
 82 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName
 83 | CC1(CC(=O)C=C(C1=O)C2=CC=C(C=C2)COC(=O)C3=CC=CC=C3)C	46904830	DSGHHJWVVKGFNI-UHFFFAOYSA-N	[4-(5,5-dimethyl-3,6-dioxocyclohexen-1-yl)phenyl]methyl benzoate
 84 | CC1=CC[C@H](/C(=C\[C@@H](C(CCC1)(C)C)OC(=O)C)/C)OC(=O)C2=CC=CC=C2	46886858	WZEJCPHMOSKQHH-FNXKFTHESA-N	[(1R,2Z,4S)-4-acetyloxy-2,5,5,9-tetramethylcycloundeca-2,9-dien-1-yl] benzoate
 85 | CC1=C2[C@H]([C@@H]([C@@]3(C=CC(=O)C(=C)[C@H]3C=C2CC1=O)C)OC(=O)C)OC(=O)C4=CC=CC=C4	44585423	DSOLMHLLLLRMOJ-BNWQNPBSSA-N	[(4R,5R,5aR,9aS)-5-acetyloxy-3,5a-dimethyl-9-methylidene-2,8-dioxo-1,4,5,9a-tetrahydrobenzo[g]azulen-4-yl] benzoate
 86 | CCOC(=O)C1=CC=CC(=C1)C2=CC(=O)CC(C2)(C)C	44143998	BBEWYQFSZQEXCH-UHFFFAOYSA-N	ethyl 3-(5,5-dimethyl-3-oxocyclohexen-1-yl)benzoate
 87 | CCC1=C(C(C(OC1=O)C2=CC=CC=C2)(C)C)OC(=O)C3=CC=CC=C3	2893657	FEWXNYDYEILDTL-UHFFFAOYSA-N	(5-ethyl-3,3-dimethyl-6-oxo-2-phenyl-2H-pyran-4-yl) benzoate
 88 | CC1=C(C(C(OC1=O)C2=CC=CC=C2)(C)C)OC(=O)C3=CC=CC=C3	569453	UXXMZHQXFIACTH-UHFFFAOYSA-N	(3,3,5-trimethyl-6-oxo-2-phenyl-2H-pyran-4-yl) benzoate
 89 | ```
 90 | 
 91 | _tested on 2021.01.27, EDirect 14.4, total count was 6._
 92 | 
 93 | Another filtering method could be to add a specific property attribute range, such as compounds containing 8 to 12 rotatable bonds (`[RBC]`):
 94 | 
 95 | ```console
 96 | 
 97 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "11044292"[UID] | \
 98 | > elink -target pccompound -name pccompound_pccompound | \
 99 | > efilter -query "8:12"[RBC] | \
100 | > efetch -format docsum | \
101 | > xtract -pattern DocumentSummary -element IsomericSmiles CID RotatableBondCount
102 | CCC(CC)(CCCCCOC(=O)C1=CC=CC=C1)C(=O)C2=CC=CC=C2	153964625	12
103 | CCCCCCCC1CCC(CC1)OC(=O)C2=CC=CC=C2	153717993	9
104 | CCCCCCC1CCC(CC1)OC(=O)C2=CC=CC=C2	153717992	8
105 | CC(C(=O)CCCC(C)(C)CCC(C)(C)C)OC(=O)C1=CC=CC=C1	153334776	11
106 | COC(=O)CCCCC[C@H](C1=CC=CC=C1)OC(=O)C2=CC=CC=C2	145778504	11
107 | CCC(C(CC(C)(C)C)OC(=O)C1=CC=CC=C1)OC(=O)C2=CC=CC=C2	142273534	10
108 | ...
109 | ...
110 | ```
111 | 
112 | _tested on 2021.01.27, EDirect 14.4, total count was 57._
113 | 
114 | It is also possible to query PubChem Compound directly for compounds with specific attributes (i.e., without the use of `efilter`). However, you will likely need to be very specific in order to retrieve a reasonable number of records. For example, in the the below script, PubChem Compound was queried for compounds containing Uranium in the element field (`[ELMT]`) and 3:5 defined chiral atoms in the AtomChiralDefCount field (`[ACDC]`):
115 | 
116 | 
117 | ```console
118 | 
119 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "(\"U\"[ELMT]) AND \"3\"[ACDC]:\"5\"[ACDC]" | \
120 | > efetch -format docsum | \
121 | > xtract -pattern DocumentSummary -element IsomericSmiles CID MolecularFormula AtomChiralDefCount
122 | C[C@@H]1CC[C@H]([C@@]2([C@]1(CCC(=C2)C)C)C)[C-]=C.O.[U]	154676185	C16H27OU-	4
123 | C[CH-]O[C@H]1[C@@H]([C@H]([C@@H](O[C@@H]1[CH2-])O)O)C.[U+2]	154572041	C9H16O4U	5
124 | [CH3-].CNCCO[C@@H]1CNC[C@@H](C1C2=CC=C(C=C2)O[C@H]3CCN(C3)C4=CC(=CC=C4)F)OCC5=CC6=C(C=C5)OCCN6CC[CH2-].[U+2]	154550507	C37H49FN4O4U	3
125 | C[C@H]1CC=C(CN1C)C2=CSC(=N2)SC3=C(N4[C-]([C@H]3C)[C@H](C4=O)[C@@H](C)O)C(=O)O.[U]	154536644	C20H24N3O4S2U-	4
126 | CC1CC2C3[C@H](C=C4C[C-](CC[C@@]4(C3CC[C@@]2(C15OCCO5)C)C)OC[CH2-])O.[U+2]	154528690	C24H36O4U	3
127 | COC1=C(C=C2C(=C1)C(=O)N3CC(=C)C[C@H]3[C@@H]([N-]2)O)OCCCCCOC4=C(C=C5C(=C4)N=C[C@@H]6CC(=C)CN6C5=O)OC.[U]	153695434	C33H37N4O7U-	3
128 | ...
129 | ...
130 | ```
131 | 
132 | _tested on 2021.01.27, EDirect 14.4, total count was 1938._
133 | 
134 | Note that I escaped (`\`) the internal quotes in the above query. Sometimes this is not necessary (in my experience it depends on the NCBI database). If you are unsure how the query is being interpreted, run the `esearch` function with the `-debug` option. You can then use `nquire` with the link output and extract out the parsed query:
135 | 
136 | 
137 | ```console
138 | 
139 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "(\"U\"[ELMT]) AND \"3\"[ACDC]:\"5\"[ACDC]" -debug
140 | ...
141 | ...
142 | user@computer:~$ nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi -retmax 0 -usehistory y -db pccompound -term "(\"U\"[ELMT]) AND \"3\"[ACDC]:\"5\"[ACDC]" | \
143 | > xtract -pattern eSearchResult -element QueryTranslation
144 | "U"[ELMT] AND "3"[AtomChiralDefCount] : "5"[AtomChiralDefCount]
145 | ```
146 | 
147 | ### Find Number of Compounds by Create Date
148 | 
149 | We can use the Create Date field `[CDAT]` in PubChem Compound to search for compound records created on a specific date. The `esearch` results are then piped into `efetch` to retrieve the XML docsum compound records. Next, the `xtract` function is used to extract out the CreateDate. The extracted data is then piped into the EDirect alias function `sort-uniq-count-rank`, which sorts the data by highest frequency. Finally, I added an additional `sort` command, to sort by date (`-k2,2` for second column), instead of number of compounds.
150 | 
151 | ```console
152 | 
153 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "2020/05/01"[CDAT]:"2020/05/31"[CDAT] | \
154 | > efetch -format docsum | \
155 | > xtract -pattern DocumentSummary -element CreateDate | \
156 | > sort-uniq-count-rank | \
157 | > sort -k2,2
158 | 448	2020/05/01 00:00
159 | 20	2020/05/02 00:00
160 | 45	2020/05/04 00:00
161 | 3	2020/05/05 00:00
162 | 67	2020/05/06 00:00
163 | 32	2020/05/07 00:00
164 | 97	2020/05/08 00:00
165 | 7827	2020/05/11 00:00
166 | 42	2020/05/12 00:00
167 | 573	2020/05/13 00:00
168 | 67	2020/05/14 00:00
169 | 75	2020/05/15 00:00
170 | 136	2020/05/16 00:00
171 | 69	2020/05/18 00:00
172 | 52	2020/05/19 00:00
173 | 66	2020/05/20 00:00
174 | 53	2020/05/21 00:00
175 | 790	2020/05/22 00:00
176 | 36	2020/05/23 00:00
177 | 2	2020/05/24 00:00
178 | 5	2020/05/25 00:00
179 | 530	2020/05/26 00:00
180 | 63	2020/05/27 00:00
181 | 169	2020/05/28 00:00
182 | 9432	2020/05/29 00:00
183 | 26	2020/05/30 00:00
184 | 2	2020/05/31 00:00
185 | ```
186 | _tested on 2021.01.27, EDirect 14.4._
187 | 
188 | If we want to obtain the number of compounds in PubChem by create date over a longer period of time (e.g., several months to years), it probably does not make sense to use `efetch`, as the number of compounds will be hundreds of thousands or even millions. Trying to download all of the docsums for this many record likely won't work. As an alternative, we can use `esearch` in a for loop, and extract out the Count value from the `esearch` ENTREZ_DIRECT query XML summary. For example, if we wanted the number of compounds created in PubChem for 2019 by month:
189 | 
190 | ```console
191 | 
192 | user@computer:~$ for date in \
193 | >   "2019/01" \
194 | >   "2019/02" \
195 | >   "2019/03" \
196 | >   "2019/04" \
197 | >   "2019/05" \
198 | >   "2019/06" \
199 | >   "2019/07" \
200 | >   "2019/08" \
201 | >   "2019/09" \
202 | >   "2019/10" \
203 | >   "2019/11" \
204 | >   "2019/12"
205 | > do
206 | >   esearch -email name@xx.edu -db pccompound -query "$date[CDAT]" |
207 | >   xtract -pattern ENTREZ_DIRECT -lbl "$date" -element Count
208 | >   sleep 1 
209 | > done
210 | 2019/01	1843612
211 | 2019/02	7970
212 | 2019/03	219313
213 | 2019/04	469125
214 | 2019/05	324068
215 | 2019/06	64691
216 | 2019/07	302326
217 | 2019/08	154938
218 | 2019/09	119817
219 | 2019/10	236148
220 | 2019/11	308727
221 | 2019/12	5444411
222 | ```
223 | _tested on 2021.01.27, total count was 12 (as expected in the for loop)._
224 | 
225 | In the above for loop bash script, I added a sleep of one second between each `esearch` query in an effort to not overload the NCBI servers.
226 | 
227 | ### Find Related PubChem Substances (same)
228 | 
229 | To find the number of related PubChem substances for a PubChem compound, we can use `elink` with Entrez link `pccompound_pcsubstance_same`:
230 | 
231 | ```console
232 | 
233 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "14333"[UID] | \
234 | > elink -target pcsubstance -name pccompound_pcsubstance_same | \
235 | > xtract -pattern ENTREZ_DIRECT -element Count
236 | 51
237 | ```
238 | 
239 | And then to retrieve information about the PubChem substances, we can pipe these results into `efetch` and `xtract`, to extract out specific information such as the SID, CurrentSourceName, SourceID, and DepositDate:
240 | 
241 | ```console
242 | 
243 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "14333"[UID] | \
244 | > elink -target pcsubstance -name pccompound_pcsubstance_same | \
245 | > efetch -format docsum | \
246 | > xtract -pattern DocumentSummary -element SID CurrentSourceName SourceID DepositDate
247 | ...
248 | 439452693	THE BioTek	bt-308998	2020/12/31 00:00
249 | 438657242	Alfa Chemistry	ACM1132394	2020/12/09 00:00
250 | 438538915	3WAY PHARM INC	SWOT-0105728	2020/12/08 00:00
251 | 435642079	Chem-Space.com Database	CSSB00032005459	2020/11/21 00:00
252 | 410573132	Google Patents	15237363	2020/08/12 00:00
253 | 404911410	The University of Alabama Libraries	UALIB-1927	2020/03/21 00:00
254 | 403383863	PATENTSCOPE (WIPO)	ORQWTLCYLDRDHK-UHFFFAOYSA-N	2020/01/24 00:00
255 | 387135315	NORMAN Suspect List Exchange	ORQWTLCYLDRDHK-UHFFFAOYSA-N	2019/11/22 00:00
256 | 386279116	Wiley	140582	2019/10/23 00:00
257 | ...
258 | ...
259 | ```
260 | _tested on 2021.01.27, total count was 51._
261 | 
262 | 


--------------------------------------------------------------------------------
/05_EDirect_PubMed_Recipes.md:
--------------------------------------------------------------------------------
  1 | # PubMed EDirect Recipes
  2 | 
  3 | **Notes**
  4 | 
  5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`.
  6 | > 2. Replace `name@xx.edu` with your email address.
  7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal.
  8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/).
  9 | 
 10 | ## PubMed EDirect
 11 | 
 12 | ### Search PubMed by Keyword and/or MeSH and Retrieve References
 13 | 
 14 | We can use the EDirect function `esearch` to query PubMed. However, before trying to retrieve any of the results with `efetch`, it is a good idea to check that the count range is manageable (e.g., on the order of several thousand). In addition, see the [EDirect Query Translation Instructions](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md#edirect-query-translation-via-debug-flag) for how to use the `-debug` option to view how your query is interpreted in PubMed.
 15 | 
 16 | ```console
 17 | 
 18 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "hydrogel-based drug delivery"
 19 | <ENTREZ_DIRECT>
 20 |   <Db>pubmed</Db>
 21 |   <WebEnv>MCID...</WebEnv>
 22 |   <QueryKey>1</QueryKey>
 23 |   <Count>436</Count>
 24 |   <Step>1</Step>
 25 |   <Email>name@xx.edu</Email>
 26 | </ENTREZ_DIRECT>
 27 | ```
 28 | 
 29 | After deciding if the `esearch` query is appropriate, we can start to pipe the `esearch` results into other EDirect functions. For example, the below script first uses `esearch` to query PubMed for "hydrogel-based drug delivery", and then these results are piped (`|`) into `efetch` to retrieve the results as XML format. The `efetch` results are then piped to the `xtract` function where several bibliographic elements of the PubMed XML records are extracted into a table:
 30 | 
 31 | ```console
 32 | 
 33 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "hydrogel-based drug delivery" | \
 34 | > efetch -format xml | \
 35 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
 36 | > Author/Initials ArticleTitle ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
 37 | 33424262	El-Masry	SM	Hydrogel-based matrices for controlled drug delivery of etamsylate: Prediction of in-vivo plasma profiles.	Saudi Pharm J	2020	28	12	1704-1718
 38 | 33398321	Chen	W	Magnetically actuated intelligent hydrogel-based child-parent microrobots for targeted drug delivery.	J Mater Chem B	2021
 39 | 33396629	Dehshahri	A	New Horizons in Hydrogels for Methotrexate Delivery.	Gels	2020	7	1
 40 | 33387892	Amiri	M	Hydrogel beads-based nanocomposites in novel drug delivery platforms: Recent trends and developments.	Adv Colloid Interface Sci	2020	288	102316
 41 | 33378390	Kloepping	KC	Triphenylphosphonium derivatives disrupt metabolism and inhibit melanoma growth in vivo when delivered via a thermosensitive hydrogel.	PLoS One	2020	15	12	e0244540
 42 | 33359482	Agarwal	P	Structural characterization and developability assessment of sustained release hydrogels for rapid implementation during preclinical studies.	Eur J Pharm Sci	2021	158	105689
 43 | ...
 44 | ...
 45 | ```
 46 | 
 47 | _tested on 2021.01.27, EDirect 14.4, total count was 436._
 48 | 
 49 | 
 50 | Note that if we want to extract out the DOIs, we can use the `xtract` `-block` option like this:
 51 | 
 52 | ```console
 53 | 
 54 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "hydrogel-based drug delivery" | \
 55 | > efetch -format xml | \
 56 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
 57 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \
 58 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId
 59 | 33424262	El-Masry	SM	Saudi Pharm J	2020	28	12	1704-1718	https://doi.org/10.1016%2Fj.jsps.2020.10.016
 60 | 33398321	Chen	W	J Mater Chem B	2021	https://doi.org/10.1039%2Fd0tb02384a
 61 | 33396629	Dehshahri	A	Gels	2020	7	1	https://doi.org/10.3390%2Fgels7010002
 62 | 33387892	Amiri	M	Adv Colloid Interface Sci	2020	288	102316	https://doi.org/10.1016%2Fj.cis.2020.102316
 63 | 33378390	Kloepping	KC	PLoS One	2020	15	12	e0244540	https://doi.org/10.1371%2Fjournal.pone.0244540
 64 | 33359482	Agarwal	P	Eur J Pharm Sci	2021	158	105689	https://doi.org/10.1016%2Fj.ejps.2020.105689
 65 | ...
 66 | ...
 67 | ```
 68 | _tested on 2021.01.27, EDirect 14.4, total count was 436._
 69 | 
 70 | 
 71 | There is a lot going on with the last line of code that extracts out the DOIs: `-block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId`. Let's look at part of a PubMed XML file to help interpret what is going on here:
 72 | 
 73 | ```console
 74 | ...
 75 | ...
 76 | <ArticleIdList>
 77 |         <ArticleId IdType="pubmed">17630804</ArticleId>
 78 |         <ArticleId IdType="doi">10.1021/jo071035l</ArticleId>
 79 | </ArticleIdList>
 80 | ...
 81 | ...
 82 | ```
 83 | 
 84 | The `-block` option limits the extraction to a particular section of the XML, in this case the `ArticleId` tags. The `@` defines the desired IdType `doi` element attribute. Finally, the `-doi` is an `xtract` string option that prefixes https://doi.org/ before the extracted ArticleId doi. There is a more thorough explanation of `-block` and extracting out the DOIs with the `-block` option in the [NLM Insider's Guide to Accessing NLM Data Part 4](https://dataguide.nlm.nih.gov/classes/edirect-for-pubmed/samplecode4.html#output-a-list-of-pmids-and-corresponding-dois) and [Entrez Programming Utilities Help Manual](https://www.ncbi.nlm.nih.gov/books/NBK179288/).
 85 | 
 86 | 
 87 | Similarly to the above script, we can specify particular fields to query within PubMed. The below script searches for "ionic liquids" in the MeSH term field (`[MESH]`) and "imidazolium" in all fields. Note that the internal quotes are escaped (`\`), which is sometimes necessary for the query to be interpreted correctly when using phrases.
 88 | 
 89 | 
 90 | ```console
 91 | 
 92 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "\"ionic liquids\"[MESH] AND imidazolium" | \
 93 | > efetch -format xml | \
 94 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
 95 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \
 96 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId
 97 | 33396149	Hu	LX	Ecotoxicol Environ Saf	2021	208	111629	https://doi.org/10.1016%2Fj.ecoenv.2020.111629
 98 | 33346267	Kaur	M	Phys Chem Chem Phys	2021	23	1	320-328	https://doi.org/10.1039%2Fd0cp04513f
 99 | 33253998	Tashakkori	P	J Chromatogr A	2021	1635	461741	https://doi.org/10.1016%2Fj.chroma.2020.461741
100 | 33142384	Ren	YM	Zhonghua Lao Dong Wei Sheng Zhi Ye Bing Za Zhi	2020	38	10	767-769	https://doi.org/10.3760%2Fcma.j.cn121094-20191010-00483
101 | 33135708	Kumar	S	Phys Chem Chem Phys	2020	22	43	25255-25263	https://doi.org/10.1039%2Fd0cp04014b
102 | 32822985	Zuo	L	J Chromatogr A	2020	1628	461446	https://doi.org/10.1016%2Fj.chroma.2020.461446
103 | 32711338	Zunita	M	Bioresour Technol	2020	315	123864	https://doi.org/10.1016%2Fj.biortech.2020.123864
104 | ...
105 | ...
106 | ```
107 | _tested on 2021.01.27, EDirect 14.4, total count was 1000._
108 | 
109 | 
110 | ### Calculate the Most Frequent Journal Titles For a PubMed Search
111 | 
112 | The below script uses `esearch` to query PubMed for "Artificial Intelligence" in the `[MESH]` field and "drug discovery" in the `[ALL]` field. The records are then retrieved as XML format using the `efetch` function, followed by extracting out the journal names (`IsoAbbreviation`) using `xtract`. The `xtract` results are then piped to the EDirect alias function `sort-uniq-count-rank`, which sorts the data by highest frequency:
113 | 
114 | ```console
115 | 
116 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "\"Artificial Intelligence\"[MESH] AND \"drug discovery\"[ALL]" | \
117 | > efetch -format xml | \
118 | > xtract -pattern PubmedArticle -element ISOAbbreviation | \
119 | > sort-uniq-count-rank
120 | 169	J Chem Inf Model
121 | 53	BMC Bioinformatics
122 | 49	PLoS One
123 | 40	Bioinformatics
124 | 39	Methods Mol Biol
125 | 33	Mol Pharm
126 | 32	Molecules
127 | 29	Sci Rep
128 | 28	Drug Discov Today
129 | 28	J Comput Aided Mol Des
130 | 24	J Med Chem
131 | 23	Expert Opin Drug Discov
132 | 23	Int J Mol Sci
133 | 19	Curr Top Med Chem
134 | 18	Mol Inform
135 | 17	Future Med Chem
136 | 16	Nucleic Acids Res
137 | 15	Nature
138 | 15	PLoS Comput Biol
139 | 14	IEEE/ACM Trans Comput Biol Bioinform
140 | ...
141 | ...
142 | ```
143 | _tested on 2021.01.27, EDirect 14.4._
144 | 
145 | ### Calculate The Frequency of Author Publications for a University Department in PubMed
146 | 
147 | The below script uses `esearch` to query PubMed for ("university of alabama" AND tuscaloosa) in the affiliation field (`[AFFL]`). Tuscaloosa was added to limit the number of retrieved records associated with The University of Alabama at Birmingham and The University of Alabama at Huntsville. Another approach could have been to use the NOT operator: `"(university of alabama[AFFL]) NOT (birmingham[AFFL] OR huntsville[AFFL])"`. However, the latter approach may eliminate any collaborative articles with these institutions (affiliation searches are challenging!). Next, the results were retrieved as XML using `efetch`, followed by piping these results to `xtract` to extract out the publication year (`PubDate/Year`) and sort by frequency with `sort-uniq-count-rank`. Note that a conditional statement was used in the `xtract` pattern to only extract results from articles if the affiliation contains both `chemistry` and `tuscaloosa`. The thought here was that this would limit the results (mostly) to author publications from The University of Alabama (Tuscaloosa) Department of Chemistry:
148 | 
149 | ```console
150 | 
151 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "(university of alabama[AFFL] AND tuscaloosa[AFFL])" | \
152 | > efetch -format xml | \
153 | > xtract -pattern PubmedArticle -if Affiliation -contains chemistry -and Affiliation -contains tuscaloosa -element PubDate/Year | \
154 | > sort-uniq-count-rank
155 | 65	2015
156 | 64	2020
157 | 59	2017
158 | 53	2018
159 | 49	2019
160 | 45	2016
161 | 41	2014
162 | 35	2013
163 | 30	2012
164 | 28	2008
165 | 26	2006
166 | 23	2007
167 | 23	2010
168 | 20	2004
169 | 19	2003
170 | 18	2009
171 | 17	1999
172 | 17	2001
173 | ...
174 | ...
175 | ```
176 | 
177 | _tested on 2021.01.27, EDirect 14.4._
178 | 
179 | If instead we want to know individual Author numbers in PubMed instead of total publications by year, we can change the `xtract` pattern:
180 | 
181 | ```console
182 | 
183 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "(university of alabama[AFFL] AND tuscaloosa[AFFL])" | \
184 | > efetch -format xml | \
185 | > xtract -pattern Author -if Affiliation -contains chemistry -and Affiliation -contains tuscaloosa -element LastName Initials | \
186 | > sort-uniq-count-rank
187 | 108	Dixon	DA
188 | 53	Vasiliu	M
189 | 34	Vincent	JB
190 | 32	Rogers	RD
191 | 22	Bowman	MK
192 | 20	Fang	Z
193 | 14	Grant	DJ
194 | 14	Thanthiriwatte	KS
195 | 13	Cassady	CJ
196 | 12	Frantom	PA
197 | 11	Chen	M
198 | 11	Kelley	SP
199 | 11	Shamshina	JL
200 | 10	Kispert	LD
201 | 10	Metzger	RM
202 | 10	Papish	ET
203 | 9	Gerlach	DL
204 | 9	Li	S
205 | 9	Timkovich	R
206 | 8	Matus	MH
207 | ...
208 | ...
209 | ```
210 | _tested on 2021.01.27, EDirect 14.4._
211 | 
212 | 
213 | Let's take a closer look at the conditional `xtract` pattern specifying to extract data only if the affiliation contains chemistry and tuscaloosa:
214 | 
215 | ```console
216 | 
217 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "(university of alabama[AFFL] AND tuscaloosa[AFFL])" | \
218 | > efetch -format xml | \
219 | > xtract -pattern Author -if Affiliation -contains chemistry -and Affiliation -contains tuscaloosa -element LastName Initials Affiliation
220 | Rowe	SJ	Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA.
221 | Mecaskey	RJ	Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA.
222 | Nasef	M	Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA.
223 | Talton	RC	Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA.
224 | Sharkey	RE	Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA.
225 | Halliday	JC	Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA.
226 | ...
227 | ...
228 | ```
229 | _tested on 2021.01.27, EDirect 14.4._
230 | 
231 | 
232 | With a quick look at the ~1000 results, it seemed like we extracted out the intended data, however, I did notice some false positive results. One example was an article from The University of Alabama (Tuscaloosa) Department of Biological Sciences with an external collaborator having "Chemistry" in the Institution name. Other errors could be what we unintentionally excluded such as any records that do not have Tuscaloosa in the affiliation field (i.e., only a partial address or zip code). These type of affiliation searches are tricky, so test often and think through the results carefully.
233 | 
234 | 
235 | ### Retrieve Cites and Cited References in PubMed
236 | 
237 | The `elink` function can retrieve associated cites and cited references for PubMed records. Cites are the available references in the article (i.e. bibliography list) and cited are references to the article. Not all PubMed articles have associated citation reference data. The available reference data are from the [NIH Open Citation Collection Dataset](https://pubmed.ncbi.nlm.nih.gov/31600197/).
238 | 
239 | To retrieve the number of cites for a PubMed article, we can use the `elink` function, followed by `xtract` to extract out the Count element:
240 | 
241 | ```console
242 | 
243 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29978703[PMID]" | \
244 | > elink -cites | \
245 | > xtract -pattern ENTREZ_DIRECT -element Count
246 | 11
247 | ```
248 | _tested on 2021.01.27, EDirect 14.4._
249 | 
250 | 
251 | Add `efetch` to your script if you want to retrieve the records:
252 | 
253 | ```console
254 | 
255 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29978703[PMID]" | \
256 | > elink -cites | \
257 | > efetch -format xml | \
258 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
259 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \
260 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId
261 | 29382051	Stefanachi	A	Molecules	2018	23	2	https://doi.org/10.3390%2Fmolecules23020250
262 | 27709885	Stempel	E	Acc Chem Res	2016	49	11	2390-2402	https://doi.org/10.1021%2Facs.accounts.6b00265
263 | 26661053	James	MJ	Chemistry	2016	22	9	2856-81	https://doi.org/10.1002%2Fchem.201503835
264 | 26313158	Liu	BY	Org Lett	2015	17	17	4380-3	https://doi.org/10.1021%2Facs.orglett.5b02230
265 | 22969063	Han	X	Angew Chem Int Ed Engl	2012	51	41	10390-3	https://doi.org/10.1002%2Fanie.201205238
266 | 18620434	Martin	R	Acc Chem Res	2008	41	11	1461-73	https://doi.org/10.1021%2Far800036s
267 | ...
268 | ...
269 | ```
270 | 
271 | _tested on 2021.01.27, EDirect 14.4._
272 | 
273 | Getting the cited records only requires changing `-cites` to `-cited`:
274 | 
275 | ```console
276 | 
277 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29978703[PMID]" | \
278 | > elink -cited | \
279 | > efetch -format xml | \
280 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
281 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \
282 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId
283 | 32537619	Fernandes	RA	Chem Commun (Camb)	2020	56	61	8569-8590	https://doi.org/10.1039%2Fd0cc02659j
284 | 32317969	Lautié	E	Front Pharmacol	2020	11	397	https://doi.org/10.3389%2Ffphar.2020.00397
285 | 30707497	Ivanova	OA	Chem Rec	2019	https://doi.org/10.1002%2Ftcr.201800166
286 | 30259622	Tymann	D	Angew Chem Int Ed Engl	2018	57	47	15553-15557	https://doi.org/10.1002%2Fanie.201808578
287 | ```
288 | _tested on 2021.01.27, EDirect 14.4._
289 | 
290 | We can answer some interesting questions with the NIH Open Citation Collection Data. For example, I noticed that the PubMed XML records for articles in *J Cheminform* contain a reference list for articles in PubMed. So, theoretically, if we query PubMed for *J Cheminform*, extract out all of the references, and sort these by frequency, we should get the most cited references in *J Cheminform* article bibliographies (caveat: in the available PubMed citation data).
291 | 
292 | In the below script, the `xtract` pattern creates a new line for each extracted reference citation PMID from the ArticleId with pubmed attribute field:
293 | 
294 | ```console
295 | 
296 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Cheminform[JOUR]" | \
297 | > efetch -format xml | \
298 | > xtract -pattern Reference -if ArticleId@IdType -equals pubmed -element ArticleId | \
299 | > sort-uniq-count-rank | \
300 | > head -n 20
301 | 90	20426451
302 | 83	21982300
303 | 80	21948594
304 | 65	12653513
305 | 47	11259830
306 | 40	10592235
307 | 38	16796559
308 | 38	27899562
309 | 37	26400175
310 | 36	8709122
311 | 33	17154509
312 | 33	19498078
313 | 30	16381955
314 | 29	21059682
315 | 29	23343401
316 | 28	15667143
317 | 28	24214965
318 | 27	17932057
319 | 27	21425294
320 | 27	22587354
321 | ```
322 | _tested on 2021.01.27, EDirect 14.4._
323 | 
324 | Note that when quickly viewing all of the sorted results (~10,000 lines), I did see maybe a 100 or so entries with two PMIDs per line or a DOI and a PMID. Since we specifically defined the pubmed IdType attribute, it is not exactly clear to me yet why there would be some extra data in there. Perhaps it is a mistake or inconsistency in the *J Cheminform* PubMed XML records. 
325 | 
326 | 
327 | We can take a quick look at the top 10 cited references using a for loop::
328 | 
329 | ```console
330 | user@computer:~$ for refs in \
331 | >     "20426451" \
332 | >     "21982300" \
333 | >     "21948594" \
334 | >     "12653513" \
335 | >     "11259830" \
336 | >     "10592235" \
337 | >     "16796559" \
338 | >     "27899562" \
339 | >     "26400175" \
340 | >     "8709122"
341 | > do
342 | >      esearch -email name@xx.edu -db pubmed -query "$refs[PMID]" |
343 | >      efetch -format xml |
344 | >      xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
345 | >      Author/Initials ArticleTitle ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \
346 | >      -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId    
347 | >      sleep 1 
348 | > done
349 | 20426451	Rogers	D	Extended-connectivity fingerprints.	J Chem Inf Model	2010	50	5	742-54	https://doi.org/10.1021%2Fci100050t
350 | 21982300	O'Boyle	NM	Open Babel: An open chemical toolbox.	J Cheminform	2011	3	33	https://doi.org/10.1186%2F1758-2946-3-33
351 | 21948594	Gaulton	A	ChEMBL: a large-scale bioactivity database for drug discovery.	Nucleic Acids Res	2012	40	Database issue	D1100-7https://doi.org/10.1093%2Fnar%2Fgkr777
352 | 12653513	Steinbeck	C	The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics.	J Chem Inf Comput Sci	43	2	493-500	https://doi.org/10.1021%2Fci025584y
353 | 11259830	Lipinski	CA	Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings.	Adv Drug Deliv Rev	2001	46	1-3	3-26	https://doi.org/10.1016%2Fs0169-409x%2800%2900129-0
354 | 10592235	Berman	HM	The Protein Data Bank.	Nucleic Acids Res	2000	28	1	235-42	https://doi.org/10.1093%2Fnar%2F28.1.235
355 | 16796559	Steinbeck	C	Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics.Curr Pharm Des	2006	12	17	2111-20	https://doi.org/10.2174%2F138161206777585274
356 | 27899562	Gaulton	A	The ChEMBL database in 2017.	Nucleic Acids Res	2017	45	D1	D945-D954	https://doi.org/10.1093%2Fnar%2Fgkw1074
357 | 26400175	Kim	S	PubChem Substance and Compound databases.	Nucleic Acids Res	2016	44	D1	D1202-13	https://doi.org/10.1093%2Fnar%2Fgkv951
358 | 8709122	Bemis	GW	The properties of known drugs. 1. Molecular frameworks.J Med Chem	1996	39	15	2887-93	https://doi.org/10.1021%2Fjm9602928
359 | ```
360 | 
361 | Another interesting question would be what is the most cited Journal in *J Cheminform* articles (in the available PubMed citation data)? In the below script, we take a similar approach to above, but instead of extracting out the PMIDs, we extract out the Citation element. The line `cut -d "." -f 1` deletes any data after the Journal abbreviation (e.g., "Drug Discov Today. 2006 Dec;11(23-24):1046-53" becomes "Drug Discov Today"). 
362 | ```console
363 | 
364 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Cheminform[JOUR]" | \
365 | > efetch -format xml | \
366 | > xtract -pattern Reference -element Citation | \
367 | > cut -d "." -f 1 | \
368 | > sort-uniq-count-rank
369 | 2274	J Chem Inf Model
370 | 1268	J Cheminform
371 | 1167	Nucleic Acids Res
372 | 930	J Med Chem
373 | 620	Bioinformatics
374 | 576	J Chem Inf Comput Sci
375 | 425	J Comput Aided Mol Des
376 | 381	BMC Bioinformatics
377 | 351	Drug Discov Today
378 | 252	Mol Inform
379 | 247	J Comput Chem
380 | 236	PLoS One
381 | 222	Nature
382 | 215	Nat Rev Drug Discov
383 | 202	Proc Natl Acad Sci U S A
384 | 181	Science
385 | 148	Anal Chem
386 | 147	Proteins
387 | 140	J Mol Graph Model
388 | ...
389 | ...
390 | ```
391 | _tested on 2021.01.27, EDirect 14.4._
392 | 
393 | Note that there is some inconsistency in the citation formats results here as well that would need to be evaluated and cleaned up for a more thorough analysis. For example, some of the extracted Citations included author names and article titles, so the `cut` command deleting everything after the first `.` does not suffice for those data entries.
394 | 
395 | 
396 | ### Number of Records in PubMed by Create Date
397 | 
398 | Here is an interesting script to retrieve the count of PubMed records by create date (`[CRDT]`) for each month of 2020. Since there are over 100,000 records added to PubMed every month, a strategy using `efetch` likely would not work (i.e., trying to retrieve 500,000+ records would take a long time).
399 | 
400 | ```console
401 | 
402 | user@computer:~$ for date in \
403 | >     "2020/01" \
404 | >     "2020/02" \
405 | >     "2020/03" \
406 | >     "2020/04" \
407 | >     "2020/05" \
408 | >     "2020/06" 
409 | > do
410 | >     esearch -email name@xx.edu -db pubmed -query "$date[CRDT]" |
411 | >     xtract -pattern ENTREZ_DIRECT -lbl "$date" -element Count
412 | >     sleep 1 
413 | > done
414 | 2020/01	108863
415 | 2020/02	107561
416 | 2020/03	106386
417 | 2020/04	124575
418 | 2020/05	121324
419 | 2020/06	124664
420 | ```
421 | _tested on 2021.01.27, EDirect 14.4._
422 | 
423 | 
424 | 
425 | ### Number of Records in PubMed that are Also freely Available in PubMed Central
426 | 
427 | Let's say we wanted to know how many articles in *J Chem Inf Model* (indexed in PubMed) are available freely in PubMed Central. We can first get a count for *J Chem Inf Model* records in PubMed by querying PubMed in the Journal field (`[JOUR]`), followed by retrieving the records, extracting out the PubDate, and then sorting by frequency:
428 | 
429 | ```console
430 | 
431 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Chem Inf Model[JOUR]" | \
432 | > efetch -format docsum | \
433 | > xtract -pattern DocumentSummary -element PubDate | \
434 | > cut -d " " -f 1 | \
435 | > sort-uniq-count-rank | \
436 | > sort -k2,2
437 | 216	2005
438 | 280	2006
439 | 246	2007
440 | 225	2008
441 | 268	2009
442 | 203	2010
443 | 297	2011
444 | 306	2012
445 | 300	2013
446 | 309	2014
447 | 247	2015
448 | 232	2016
449 | 283	2017
450 | 237	2018
451 | 490	2019
452 | 612	2020
453 | 65	2021
454 | ```
455 | _tested on 2021.01.27, EDirect 14.4._
456 | 
457 | 
458 | In the above script, the line `cut -d " " -f 1` deletes any data appearing after the year and `sort -k2,2` sorts the data by the second column. Next, we can add `elink` into our script to find the linked records in PubMed Central (`pmc`) from the Entrez link `pubmed_pmc`:
459 | 
460 | ```console
461 | 
462 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Chem Inf Model[JOUR]" | \
463 | > elink -target pmc -name pubmed_pmc | \
464 | > efetch -format docsum | \
465 | > xtract -pattern DocumentSummary -element PubDate | \
466 | > cut -d " " -f 1 | \
467 | > sort-uniq-count-rank | \
468 | > sort -k2,2
469 | 1	2005
470 | 3	2006
471 | 7	2007
472 | 14	2008
473 | 31	2009
474 | 26	2010
475 | 62	2011
476 | 38	2012
477 | 55	2013
478 | 59	2014
479 | 38	2015
480 | 32	2016
481 | 39	2017
482 | 43	2018
483 | 57	2019
484 | 51	2020
485 | ```
486 | 
487 | _tested on 2021.01.27, EDirect 14.4._
488 | 
489 | Note that if you have a query returning tens of thousands of results, you would likely want to use a strategy without `efetch`, such as adding a date into your `esearch` query, followed by extracting out the count element from the XML.
490 | 
491 | 
492 | 


--------------------------------------------------------------------------------
/03_EDirect_PubChem_BioAssay_PubMed_Recipes.md:
--------------------------------------------------------------------------------
  1 | # PubChem <--> PubChem BioAssay <--> PubMed EDirect Recipes
  2 | 
  3 | **Notes**
  4 | 
  5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`.
  6 | > 2. Replace `name@xx.edu` with your email address.
  7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal.
  8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/).
  9 | 
 10 | ## EDirect PubChem Entrez Links
 11 | 
 12 | ### PubChem Compound --> PubMed Citations
 13 | **Description:** Search for a CID in the PubChem Compound Database and retrieve related PubMed linked references.
 14 | 
 15 | In the below script, we use the `esearch` function to query the PubChem Compound database (`pccompound`) for CID 174076 within the Compound ID field, `[uid]`. The `esearch` results are then piped to `elink` finding related PubMed citations via the Entrez link `pccompound_pubmed`. Finally, we retrieve the results with `efetch` in XML format and extract out some bibliographic reference information using the `xtract` function.
 16 | 
 17 | ```console
 18 | 
 19 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 174076[uid] | \
 20 | > elink -target pubmed -name pccompound_pubmed | \
 21 | > efetch -format xml | \
 22 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
 23 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
 24 | 22957575	Gabl	S	J Chem Phys	2012	137	9	094501
 25 | 22868451	Zhang	Y	Phys Chem Chem Phys	2012	14	35	12157-64
 26 | 22859056	Malberg	F	Phys Chem Chem Phys	2012	14	35	12079-82
 27 | 22852554	Zhang	Y	J Phys Chem B	2012	116	33	10036-48
 28 | 22662183	Zhang	BB	PLoS ONE	2012	7	5	e37641
 29 | ...
 30 | ```
 31 | _tested on 2021.01.26, EDirect 14.4, total count was 102._
 32 | 
 33 | ### PubChem Compound --> PubMed Citations (with filtering)
 34 | **Description:** Search for CID in PubChem Compound Database, find related PubMed citations, then only retrieve references from a specific journal.
 35 | 
 36 | We can filter `elink` results with `efilter` to only include PubMed citations (Entrez linked via `pccompound_pubmed`) to the CID but also matching a specific PubMed query. For example, if we are only interested in linked _Phys Chem Chem Phys_ references to CID 174076, we can use the journal field `[JOUR]` in an `efilter` query:
 37 | 
 38 | ```console
 39 | 
 40 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 174076[uid] | \
 41 | > elink -target pubmed -name pccompound_pubmed | \
 42 | > efilter -query "Phys Chem Chem Phys"[JOUR] | \
 43 | > efetch -format xml | \
 44 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
 45 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
 46 | 22868451	Zhang	Y	Phys Chem Chem Phys	2012	14	35	12157-64
 47 | 22859056	Malberg	F	Phys Chem Chem Phys	2012	14	35	12079-82
 48 | 22451012	Sillars	FB	Phys Chem Chem Phys	2012	14	17	6094-100
 49 | 21643581	Pensado	AS	Phys Chem Chem Phys	2011	13	30	13518-26
 50 | 21643580	Schröder	C	Phys Chem Chem Phys	2011	13	26	12240-8
 51 | ...
 52 | ```
 53 | _tested on 2021.01.26, EDirect 14.4, total count was 11._
 54 | 
 55 | ### PubChem Compound --> PubMed MeSH (with filtering)
 56 | **Description:** Search for a CID in PubChem Compound, find related PubMed records via MeSH, and retrieve only references that contain the MeSH subheading "chemical synthesis".
 57 | 
 58 | This is my favorite literature search: start with a PubChem CID and then find PubMed literature related to its synthesis. Similarly to the search above, we can filter out references using an `efilter` query for 'chemical synthesis' as a MeSH subheading `[SUBH]`. Note that we used the `pccompound_pubmed_mesh` Entrez link as the `elink` target name here.
 59 | 
 60 | ```console
 61 | 
 62 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 94257[uid] | \
 63 | > elink -target pubmed -name pccompound_pubmed_mesh | \
 64 | > efilter -query "chemical synthesis"[SUBH] | \
 65 | > efetch -format xml | \
 66 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID ArticleTitle \
 67 | > ISOAbbreviation PubDate/Year
 68 | 28463562	Enantioselective Chemical Syntheses of the Furanosteroids (-)-Viridin and (-)-Viridiol.	J. Am. Chem. Soc.	2017
 69 | 23040731	Viridin analogs derived from steroidal building blocks.	Bioorg. Med. Chem. Lett.	2012
 70 | 22849426	Synthetic studies on furanosteroids: construction of the viridin core structure via Diels-Alder/retro-Diels-Alder and vinylogous Mukaiyama aldol-type reaction.	J. Org. Chem.	2012
 71 | 19644878	Abrogation of antibody-induced arthritis in mice by a self-activating viridin prodrug and association with impaired neutrophil and endothelial cell function.	Arthritis Rheum.	2009
 72 | 19572524	Pentacyclic furanosteroids: the synthesis of potential kinase inhibitors related to viridin and wortmannolone.	J. Org. Chem.	2009
 73 | ...
 74 | ```
 75 | _tested on 2021.01.26, EDirect 14.4, total count was 8._
 76 | 
 77 | 
 78 | ### PubChem Compound --> PubMed Citations OR PubMed MeSH
 79 | **Description:** Search for a CID in PubChem Compound, find related PubMed citations and related PubMed citations via MeSH.
 80 | 
 81 | It appears that you can combine `elink` queries, with either the same Entrez link or a different Entrez link, but within the same database. For example, if we want to retrieve PubMed literature related to PubChem CID 174076 for both the `pccompound_pubmed` and `pccompound_pubmed_mesh` Entrez links in one dataset, we combine two separate `elink` queries with an OR operator:
 82 | 
 83 | ```console
 84 | 
 85 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 174076[uid] | \
 86 | > elink -target pubmed -name pccompound_pubmed -label pubmed_cit | \
 87 | > esearch -email name@xx.edu -db pccompound -query 174076[uid] | \
 88 | > elink -target pubmed -name pccompound_pubmed_mesh -label pubmed_mesh_cit | \
 89 | > esearch -query "(#pubmed_cit) OR (#pubmed_mesh_cit)" | \
 90 | > efetch -format xml | \
 91 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
 92 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
 93 | 32231037	Babicka	M	Molecules	2020	25	7
 94 | 31931064	Love	SA	Int J Biol Macromol	2020	147	569-575
 95 | 31818016	Wang	F	Int J Mol Sci	2019	20	24
 96 | 31814059	Weber	AL	Orig Life Evol Biosph	2019	49	4	199-211
 97 | 31675504	Gomez-Herrero	E	Ecotoxicol Environ Saf	2020	187	109836
 98 | 31520950	Pal	S	Ecotoxicol Environ Saf	2019	184	109634
 99 | ...
100 | ```
101 | _tested on 2021.01.26, EDirect 14.4, total count was 317._
102 | 
103 | 
104 | ### PubChem Substance --> PubChem Compound --> PubMed Publisher
105 | **Description:** Search for a PubChem Substance Data Source Depositor, find related same PubChem Compounds, and then retrieve related PubMed references linked via publisher.
106 | 
107 | In the below script, we first search the PubChem Substance (`pcsubstance`) database using `esearch` for the data source depositor _Nature Communications_. We can use the Current Source Name `[CSN]` field for this query. Note that an underscore is put in place of the space in the query. This syntax is important for searching in PubChem with the EDirect `esearch` function. After `esearch`, we pipe the results into `elink` twice, first finding related PubChem Compounds via the `pcsubstance_pccompound_same` Entrez link, and then using this new result list to find related PubMed publisher deposited citations from the `pccompound_pubmed_publisher` Entrez link. Finally, similarly to previous searches, we use a combination of `efetch` and `xtract` to retrieve selected data:
108 | 
109 | ```console
110 | 
111 | user@computer:~$ esearch -email name@xx.edu -db pcsubstance -query "nature_communications"[CSN] | \
112 | > elink -target pccompound -name pcsubstance_pccompound_same | \
113 | > elink -target pubmed -name pccompound_pubmed_publisher | \
114 | > efetch -format xml | 
115 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName Author/Initials \
116 | > ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
117 | 26673265	Gilbert	ZW	Nat Chem	2016	8	1	63-8
118 | 25424885	Yan	T	Nat Commun	2014	5	5602
119 | 25422853	Vaidya	AB	Nat Commun	2014	5	5521
120 | 25382411	Dommerholt	J	Nat Commun	2014	5	5378
121 | 25382259	Wang	B	Nat Commun	2014	5	5354
122 | ...
123 | ```
124 | 
125 | _tested on 2021.01.26, EDirect 14.4, total count was 101._
126 | 
127 | 
128 | ### PubChem Substance --> PubChem Compound <--> PubMed Publisher
129 | **Description:** Search for a PubChem Substance Data Source Depositor, find related same PubChem Compounds, and then retrieve related PubMed PMIDs linked via publisher.
130 | 
131 | Building upon the previous search, if needed, it is possible to obtain individual relationships of the CIDs to PubMed IDs (CID <--> PMID). We can do this using the `-cmd neighbor` option in `elink`:
132 | 
133 | ```console
134 | 
135 | user@computer:~$ esearch -email name@xx.edu -db pcsubstance -query "nature_communications"[CSN] | \
136 | > elink -target pccompound -name pcsubstance_pccompound_same | \
137 | > elink -target pubmed -name pccompound_pubmed_publisher -cmd neighbor | \
138 | > xtract -pattern LinkSet -element Id
139 | 146033657	24398593
140 | 136286496
141 | 136264969	24177669
142 | 136264968	24177669
143 | 136262920	23385592
144 | 136262919	23385592
145 | 136247006	24457545
146 | 136247005	24457545
147 | 136247004	24457545
148 | 136247003	24457545
149 | 136219971
150 | 135922679	22027590
151 | 91868204	23764831
152 | ...
153 | ```
154 | _tested on 2021.01.26, EDirect 14.4, total count was 1594 (returns all CIDs, not all have linked PMIDs)._
155 | 
156 | The first column contains the PubChem CIDs and the second column contains the linked PMIDs. Additional linked PMIDs are placed in subsequent columns when available.
157 | 
158 | 
159 | ### PubChem Compound --> PubChem BioAssay
160 | **Description:** Search for a PubChem CID in PubChem Compound, then retrieve related PubChem active BioAssay data.
161 | 
162 | To retrieve BioAssay results labeled as 'Active' that are linked to a CID, we can use the `elink` function with the PubChem BioAssay (`pcassay`) database via Entrez link `pccompound_pcassay_active`. This is followed by `efetch` and `xtract`. In this particular example, we extracted the AID, CurrentSourceName, AssayName, ActiveSidCount, and TargetCount:
163 | 
164 | ```console
165 | 
166 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "6303"[uid] | \
167 | > elink -target pcassay -name pccompound_pcassay_active | \
168 | > efetch -format docsum | \
169 | > xtract -pattern DocumentSummary -element AID CurrentSourceName AssayName ActiveSidCount TargetCount
170 | 1255098	ChEMBL	Inhibition of TLR4-mediated NF-kappaB signaling pathway in BALB/c mouse RAW264.7 cells assessed as suppression of LPS-stimulated PGE2 production at 1 to 10 ug/ml preincubated for 1 hr followed by LPS challenge measured after 6 hrs by immunoblot analysis	1	1
171 | 1255092	ChEMBL	Inhibition of TLR4-mediated NF-kappaB signaling pathway in BALB/c mouse RAW264.7 cells assessed as suppression of LPS-stimulated TNF-alpha production at 1 to 10 ug/ml preincubated for 1 hr followed by LPS challenge measured after 6 hrs by immunoblot analysis	1	1
172 | 751324	ChEMBL	Inhibition of NFkappaB p65 nuclear translocation in mouse RAW264.7 cells after 24 hrs by DAPI staining-based laser confocal immunofluorescent microscopic analysis	1	1
173 | 
174 | ...
175 | ```
176 | _tested on 2021.01.26, EDirect 14.4, total count was 47._
177 | 
178 | 
179 | ### PubChem Compound <--> PubChem BioAssay
180 | **Description:** Search for a PubChem CID in PubChem Compound Database, find related compounds with same connectivity, then retrieve related AIDs for each CID.
181 | 
182 | It is possible to obtain individual relationships of the CIDs to BioAssay AIDs (CID <--> AID). We can do this using the `-cmd neighbor` option in `elink`. Note that we first found related compounds with same connectivity using the Entrez link `pccompound_pccompound_sameconnectivity_pulldown`. This step was followed by the `pccompound_pcassay_active` Entrez link in the PubChem BioAssay database to retrieve AID links to the CIDs. We used the 'Active' assay links here. There are also other Entrez PubChem Compound assay links such as inactive, `pccompound_pcassay_inactive`.
183 | 
184 | ```console
185 | 
186 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "6303"[uid] | \
187 | > elink -target pccompound -name pccompound_pccompound_sameconnectivity_pulldown | \
188 | > elink -target pcassay -name pccompound_pcassay_active -cmd neighbor | \
189 | > xtract -pattern LinkSet -element Id
190 | ...
191 | ...
192 | 6335098
193 | 688425
194 | 451875	1347103	1296009	2551	2546
195 | 248010	687016	652245	651719	177	175	173	171	169	167	165	163	161	159	157	155	153	149	147
196 | 6303	1346987	1259407	1255098	1255092	1207585	1207584	1207579	1207578	1207577	1207576	1167619	1159565	1159562	1159559	1159557	1065715	1065714	1065710	1065706	1065697	1065696	1065695	1065713	1065705	1065699	751324	686979	686978	651820	652245	651719	602346	602250	588511	493002	463218	463212	416870	416743	216185	86858	81069	32353	32352	31719	31718	2467
197 | ```
198 | _tested on 2021.01.26, EDirect 14.4, total count was 24 CIDs (not all have associated AIDs)._
199 | 
200 | 
201 | ## EDirect PubMed Entrez Links
202 | 
203 | ### PubMed --> PubChem Compound
204 | **Description:** Search for a PubMed article ID (PMID), then retrieve related PubChem Compounds.
205 | 
206 | In the below script, we first use `esearch` to query PubMed for the article ID 29407984 in the `[PMID]` field. This result is then piped into `elink` to retrieve linked compounds in the PubChem Compound database (`pubmed_pccompound`). In this case, there was one compound and we used `efetch` to retrieve the CID record as docsum XML, followed by `xtraxt` to extract out the IsomericSmiles, CID, and InChIKey values. 
207 | 
208 | ```console
209 | 
210 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29407984"[PMID] | \
211 | > elink -target pccompound -name pubmed_pccompound | \
212 | > efetch -format docsum | \
213 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey
214 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O	2764	MYSWGUAQZAJSOK-UHFFFAOYSA-N
215 | 
216 | ```
217 | _tested on 2021.01.26, EDirect 14.4, total count was 1._
218 | 
219 | 
220 | ### PubMed --> PubChem Compound (+ mixtures)
221 | **Description:** Search for a PubMed article ID (PMID), then retrieve linked PubChem Compound mixtures/components.
222 | 
223 | In this script, an additional `elink` search is added to find related PubChem Mixture/Component compounds via Entrez link `pccompound_pccompound_mixture`.
224 | 
225 | ```console
226 | 
227 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29407984"[PMID] | \
228 | > elink -target pccompound -name pubmed_pccompound | \
229 | > elink -target pccompound -name pccompound_pccompound_mixture | \
230 | > efetch -format docsum | xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey
231 | ...
232 | ...
233 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O.C(CO)N(CCO)CCO	154963193	NGBBVVPJSFAHHI-UHFFFAOYSA-N
234 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O.C(=O)(C(=O)O)O.[Na]	154963186	HNZYVRRDOWOQIT-UHFFFAOYSA-N
235 | CNC.C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O	154963184	NBGZCMHVBXSHSN-UHFFFAOYSA-N
236 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)[OH2+]	153275427	MYSWGUAQZAJSOK-UHFFFAOYSA-O
237 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)[O-]	152748405	MYSWGUAQZAJSOK-UHFFFAOYSA-M
238 | ...
239 | ...
240 | ```
241 | _tested on 2021.01.26, EDirect 14.4, total count was 375._
242 | 
243 | 
244 | 
245 | ### PubMed --> PubChem Compound (MESH search)
246 | **Description:** Search PubMed with a text query, then retrieve linked PubChem Compounds.
247 | 
248 | We can also perform text queries in PubMed and retrieve linked PubChem Compounds. Note that in the below script we searched for "ionic liquids" in the `[MESH]` field and Imidazolium in any field. Since this query requires two pairs of quotes, we have to escape the internal quotes in order for the query to be interpreted correctly. The Entrez link `pubmed_pccompound` was used to find related PubChem compounds.
249 | 
250 | ```console
251 | 
252 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "\"ionic liquids\"[MESH] AND imidazolium" | \
253 | > elink -target pccompound -name pubmed_pccompound | \
254 | > efetch -format docsum | \
255 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey
256 | C1=CC2=CC(=C(C(=C2C(=O)C(=C1)O)O)O)O	135403797	WDGFFVCWBZVLCE-UHFFFAOYSA-N
257 | C1=NC2=C(N1[C@H]3[C@@H]([C@@H]([C@H](O3)CO)O)O)N=C(NC2=O)N	135398635	NYHBQMYGNKIUIF-UUOKFMHZSA-N
258 | CCCCCCCCN1C=C[N+](=C1C2=[N+](C=CN2CCCC)C)C	123995430	DRJFJBHYMOHPHX-UHFFFAOYSA-N
259 | CC(=O)OC1=[N+](C=CN1CC=C)C	123614562	XSXMFLUARQMOLS-UHFFFAOYSA-N
260 | C[N+]1=C(N(C=C1)CCCCCCCCCCCCS)OC(=O)OC2=[N+](C=CN2CCCCCCCCCCCCS)C	123431445	DRJOSAVMFCYCSU-UHFFFAOYSA-P
261 | ...
262 | ```
263 | _tested on 2021.01.27, EDirect 14.4, total count was 395._
264 | 
265 | ### PubMed --> PubChem Compound (MESH search, and a PubChem filter)
266 | **Description:** Search PubMed with a text query and retrieve only linked compounds containing defined chiral atoms.
267 | 
268 | We can also perform some powerful filtering with `efilter`. In the below script, the `[ACDC]` field is the defined atom chiral count in PubChem. A range of 1 through 100 was added for this `[ACDC]` filter. Since it is unlikely that any of the compounds would have near 100 chiral atoms, we can be fairly confident this should capture most, if not all, cases in our search.
269 | 
270 | ```console
271 | 
272 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "\"ionic liquids\"[MESH] AND imidazolium" | \
273 | > elink -target pccompound -name pubmed_pccompound | \
274 | > efilter -query "1:100"[ACDC] | \
275 | > efetch -format docsum | \
276 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey
277 | C1=NC2=C(N1[C@H]3[C@@H]([C@@H]([C@H](O3)CO)O)O)N=C(NC2=O)N	135398635	NYHBQMYGNKIUIF-UUOKFMHZSA-N
278 | C([C@H](C(=O)[C@H](CO)O)O)O	54067296	WXYXERHRDKEISL-ZXZARUISSA-N
279 | B(O)(O)OCC(=O)[C@H]([C@@H]([C@@H](CO)O)O)O	53705729	BXAZSOZNRVUIGN-UYFOZJQFSA-N
280 | C([C@H]1[C@@H]([C@H]([C@@H]([C@@H](O1)O[C@@H]2[C@@H](O[C@H]([C@@H]([C@H]2O)O)O)CO)O)O)O)O	46936190	GUBGYTABKSRVRQ-AEDSEYDFSA-N
281 | C1[C@H](OC2=CC(=CC(=C2C1=O)O)OC3C(C(C(C(O3)CO)O)O)O)C4=CC=C(C=C4)O	42607902	DLIKSSGEMUFQOK-CEFFZDIVSA-N
282 | ...
283 | ```
284 | _tested on 2021.01.27, EDirect 14.4, total count was 43._
285 | 
286 | 
287 | ### PubMed --> PubChem Compounds + PubChem Compounds (MeSH) + PubChem Compounds (Publisher)
288 | **Description:** Search PubMed, then find linked PubChem Compounds, PubChem Compounds via PubMed MeSH, and PubChem Compound PubMed Publisher.
289 | 
290 | As seen in the previous PubChem searches, there are several Entrez links from PubMed to PubChem Compound such as `pubmed_pccompound`, `pubmed_pccompound_mesh`, and `pubmed_pccompound_publisher`. We can retrieve associated compounds from all three at the same time like this:
291 | 
292 | ```console
293 | 
294 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "imidazolium AND bacteria" | \
295 | > elink -target pccompound -name pubmed_pccompound -label compounds_01 | \
296 | > esearch -email name@xx.edu -db pubmed -query "imidazolium AND bacteria" | \
297 | > elink -target pccompound -name pubmed_pccompound_mesh -label compounds_02 | \
298 | > esearch -email name@xx.edu -db pubmed -query "imidazolium AND bacteria" | \
299 | > elink -target pccompound -name pubmed_pccompound_publisher -label compounds_03 | \
300 | > esearch -query "(#compounds_01) OR (#compounds_02) OR (#compounds_03)" | \
301 | > efetch -format docsum | \
302 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey
303 | C[Si](C)(O)O[Si](C)(C)O.C[Si](CCC1=CC=CC=C1)(O)O[Si](C)(CCC2=CC=CC=C2)O	155288862	HYTMPCHVCUAIOW-UHFFFAOYSA-N
304 | CC(C1CCC(C(O1)OC2C(CC(C(C2O)OC3C(C([C@@](CO3)(C)O)NC)O)N)N)N)NC	146157093	CEAZRRDELHUEMR-NWNXOGAHSA-N
305 | C1=CC=C2C(=C1)C=C(N2)CC3=CC=C(C=C3)C(F)(F)F.C1=CC=C2C(=C1)C=C(N2)CC3=CC=C(C=C3)C(F)(F)F	139191468	NNXWVRXROOQQNO-UHFFFAOYSA-N
306 | CC[C@@H]1[C@@]2([C@@H]([C@H](C(=O)[C@@H](C[C@@]([C@@H]([C@H](C(=O)[C@H](C(=O)O1)C)C)O[C@@H]3[C@@H]([C@H](C[C@H](O3)C)N(C)C)O)(C)OC)C)C)N(C(=O)O2)CCCCN4C=C(N=C4)C5=CN=CC=C5)C	138402871	LJVAJPDWBABPEJ-WMGYHEQLSA-N
307 | CCOC(=O)/C(=N\NC1=CC=CC2=C1N=CC=C2)/C3=[N+](C=CN3)C	136199795	ZHOWQGCGVQGEKD-UHFFFAOYSA-O
308 | ...
309 | ...
310 | ```
311 | _tested on 2021.01.27, EDirect 14.4, total count was 568._
312 | 
313 | 
314 | ### PubMed <--> PubChem Compound
315 | **Description:** Search PubMed for an affiliation, find related PubChem Compounds, then retrieve related CIDs for each PMID.
316 | 
317 | If we want to retrieve the PMID <--> CID relationships (for Entrez link `pubmed_pccompound`), we can achieve this using the `-cmd neighbor` option in `elink`:
318 | 
319 | ```console
320 | 
321 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "(university of alabama[AFFL]) \
322 | > NOT (birmingham[AFFL] OR huntsville[AFFL])" \
323 | > -datetype PDAT -mindate 2010 -maxdate 2020 | \
324 | > elink -target pccompound -name pubmed_pccompound -cmd neighbor | \
325 | > xtract -pattern LinkSet -element Id
326 | ...
327 | ...
328 | 21800250
329 | 21783326	18679079	8914	942	702
330 | 21782896
331 | 21756136
332 | 21728552
333 | 21718269
334 | 21711000	561577	169577	166929	166928	164636
335 | 21702462
336 | 21693669
337 | 21692575
338 | ...
339 | ...
340 | ```
341 | 
342 | _tested on 2021.01.27, EDirect 14.4, total count was 3639 (returns all PMIDs, not all have linked CIDs)._
343 | 
344 | The first column contains the PMIDs and the second column contains the linked PubChem CIDs (from the `pubmed_pccompound` links). As an aside, the PubMed query for "university of alabama" in the affiliation field (`[AFFL]`) excludes (NOT operator) any results containing huntsville or birmingham in the affiliation. This excludes references from University of Alabama at Birmingham and University of Alabama at Huntsville (including collaborative references with the Tuscaloosa campus).
345 | 
346 | 
347 | ### PubMed --> PubChem BioAssay
348 | **Description:** Search PubMed for an article, find related PubChem BioAssays, then retrieve some BioAssay data.
349 | 
350 | ```console
351 | 
352 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "32459468"[PMID] | \
353 | > elink -target pcassay -name pubmed_pcassay | \
354 | > efetch -format docsum | \
355 | > xtract -pattern DocumentSummary -element AID CurrentSourceName AssayName ActiveSidCount TargetCount
356 | 1347414	National Center for Advancing Translational Sciences (NCATS)	qHTS to identify inhibitors of the type 1 interferon - major histocompatibility complex class I in skeletal muscle: Secondary screen by immunofluorescence	0	1
357 | 1347412	National Center for Advancing Translational Sciences (NCATS)	qHTS assay to identify inhibitors of the type 1 interferon - major histocompatibility complex class I in skeletal muscle: Counter screen cell viability and HiBit confirmation	0	1
358 | 1347415	National Center for Advancing Translational Sciences (NCATS)	qHTS to identify inhibitors of the type 1 interferon - major histocompatibility complex class I in skeletal muscle: tertiary screen by RT-qPCR	34	1
359 | 1347413	National Center for Advancing Translational Sciences (NCATS)	qHTS to identify inhibitors of the type 1 interferon - major histocompatibility complex class I in skeletal muscle: tertiary screen by RT-qPCR, retest select compounds	3	1
360 | ...
361 | ```
362 | _tested on 2021.01.27, EDirect 14.4, total count was 7._
363 | 
364 | ### PubMed <--> PubChem BioAssay
365 | **Description:** Search PubMed for an article, find cited articles, then related PubChem BioAssays.
366 | 
367 | If we want to retrieve the PMID <--> AID relationships (for Entrez link `pubmed_pcassay`), we can achieve this using the `-cmd neighbor` option in `elink`. Note that here we queried PubMed for an article, then found the cited articles with `elink -cited`, before piping these results into the Entrez link `pubmed_pcassay`.
368 | 
369 | 
370 | ```console
371 | 
372 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17876319"[PMID] | \
373 | > elink -cited | \
374 | > elink -target pcassay -name pubmed_pcassay -cmd neighbor | \
375 | > xtract -pattern LinkSet -element Id
376 | ...
377 | ...
378 | 21167154
379 | 21164511
380 | 21159777
381 | 21138309	568760	568754	568753	568763	568762	568761	568759	568758	568757	568756	568755
382 | 21131971
383 | 21129186
384 | ...
385 | ...
386 | ```
387 | _tested on 2021.01.27, EDirect 14.4, total count was 366 (returns all PMIDs, not all have linked AIDs)._
388 | 
389 | In the above table, the first column contains the PMIDs, subsequent columns contain the linked BioAssays (AIDs).
390 | 
391 | ## EDirect PubChem BioAssay Entrez Links
392 | 
393 | ### PubChem BioAssay --> PubMed
394 | **Description:** Search PubChem BioAssay for assays from a specific source name and then find related PubMed literature.
395 | 
396 | In the below script, we first use `esearch` to query PubChem BioAssay for IUPHAR/BPS_Guide_to_PHARMACOLOGY in the Source Name field (`[SNME]`). This result is then piped into `elink` to retrieve linked records in the PubMed database (`pcassay_pubmed`). The `efilter` function was used to limit the results to the last 5 years. This resulted in 332 record, and we used `efetch` to retrieve the PubMed records as XML, followed by `xtract` to extract out some bibliographic information.
397 | 
398 | ```console
399 | 
400 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "IUPHAR/BPS_Guide_to_PHARMACOLOGY"[SNME] | \
401 | > elink -target pubmed -name pcassay_pubmed | \
402 | > efilter -mindate 2015 -maxdate 2020 -datetype PDAT | \
403 | > efetch -format xml | \
404 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
405 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
406 | 29722898	Fu	R	Br. J. Pharmacol.	2018	175	14	3034-3049
407 | 29688582	Kato	M	Br J Clin Pharmacol	2018	84	8	1821-1829
408 | 29683659	Pike	KG	J. Med. Chem.	2018	61	9	3823-3841
409 | 29674331	Kawaharada	S	J. Pharmacol. Exp. Ther.	2018	366	1	58-65
410 | 29672049	Gucký	T	J. Med. Chem.	2018	61	9	3855-3869
411 | 29620892	Nikolaou	A	J. Med. Chem.	2018	61	8	3697-3711
412 | 29615471	Xu	X	J. Pharmacol. Exp. Ther.	2018	365	3	624-635
413 | 29608575	Taylor Meadows	KR	PLoS ONE	2018	13	4	e0193236
414 | ...
415 | ```
416 | _tested on 2021.01.27, EDirect 14.4, total count was 332._
417 | 
418 | ### PubChem BioAssay --> PubChem Compound
419 | **Description:** Search PubChem BioAssay for an assay, find related PubChem Compounds, and retrieve some property data for the compounds.
420 | 
421 | In the below script, we first use `esearch` to query PubChem BioAssay for the assay ID 527855 in the `[UID]` field. This result is then piped into `elink` to retrieve linked compounds in the PubChem Compound database (`pcassay_pccompound`). In this case, there were 16 compounds and we used `efetch` to retrieve the CID records as docsum XML, followed by `xtract` to extract the IsomericSmiles, CID, HydrogenBondDonorCount, HydrogenBondAcceptorCount, MolecularWeight, and XLogP values.
422 | 
423 | ```console
424 | 
425 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "527855"[UID] | \
426 | > elink -target pccompound -name pcassay_pccompound | \
427 | > efetch -format docsum | \
428 | > xtract -pattern DocumentSummary -element IsomericSmiles CID HydrogenBondDonorCount HydrogenBondAcceptorCount \
429 | > MolecularWeight XLogP
430 | CN(CC1=CC=CC=C1)C(=O)C2=C(NC(=N2)C3=CC=CC=C3)C(=O)O	52949178	2	4	335.400	2.9
431 | C1=CC=C(C=C1)CCN(CC2=CC=CC=C2)C(=O)C3=C(NC(=N3)C4=CC=CC=C4)C(=O)O	52948352	2	4	425.500	4.8
432 | CN(CC(=O)O)C(=O)C1=C(NC(=N1)C2=CC=CC=C2)C(=O)O	52947957	3	6	303.270	1
433 | C1=CC=C(C=C1)CNC(=O)C2=C(NC(=N2)C3=CC=CC=C3)C(=O)O	52946755	3	4	321.300	2.7
434 | CCOC(=O)CN(CC1=CC=CC=C1)C(=O)C2=C(NC(=N2)C3=CC=CC=C3)C(=O)O	52945544	2	6	407.400	3.2
435 | C1=CC=C(C=C1)CN(CC2=CC=CC=C2)C(=O)C3=C(NC(=N3)C4=CC(=CC=C4)Cl)C(=O)O	52944295	2	4	445.900	5
436 | CCNC(=O)C1=C(NC(=N1)C2=CC=CC=C2)C(=O)O	52941818	3	4	259.260	1.6
437 | CNC(=O)C1=C(NC(=N1)C2=CC=CC=C2)C(=O)O	52941817	3	4	245.230	1.2
438 | ...
439 | ```
440 | _tested on 2021.01.27, EDirect 14.4, total count was 16._
441 | 
442 | ### PubChem BioAssay <--> PubChem Compound
443 | **Description:** Search PubChem BioAssay for an assay, find related assays based on similar publications, then find related PubChem Compounds.
444 | 
445 | If we want to retrieve the AID <--> CID relationships (for Entrez link `pcassay_pccompound`), we can achieve this using the `-cmd neighbor` option in `elink`. Here we queried PubChem BioAssay for an assay, then found related assays by similar publication list using `elink` (`pcassay_pcassay_similar_publication_list`). This result was then piped into the Entrez link `pcassay_pccompound`.
446 | 
447 | 
448 | ```console
449 | 
450 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "527855"[UID] | \
451 | > elink -target pcassay -name pcassay_pcassay_similar_publication_list | \
452 | > elink -target pccompound -name pcassay_pccompound -cmd neighbor | \
453 | > xtract -pattern LinkSet -element Id
454 | ...
455 | ...
456 | 601409	54580326
457 | 601408	54580326	53257623
458 | 601154	54580326
459 | 657046	16093559
460 | 657045	70695880	70693764	70687505	70687504	70687503	70683264	70683263	70681155	70681154	70681153	60150625
461 | 527862	52948352
462 | 527861	52948352
463 | ...
464 | ...
465 | ...
466 | ```
467 | 
468 | In the above table, the first column contains the AIDs, subsequent columns contain the linked PubChem Compounds (CIDs).
469 | 
470 | _tested on 2021.01.27, EDirect 14.4, total count was 81 (returns all AIDs, not all have linked CIDs)._
471 | 
472 | 


--------------------------------------------------------------------------------
/02_EDirect_Data_Fields_Structure.md:
--------------------------------------------------------------------------------
  1 | # Available EDirect Databases, Data Fields, and Data Structures
  2 | 
  3 | **Notes**
  4 | 
  5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`.
  6 | > 2. Replace `name@xx.edu` with your email address.
  7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal.
  8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/).
  9 | 
 10 | We can view available Entrez databases, data fields, and links (connected records) with the EDirect `einfo` function. To retrieve a list of all databases, use the `-dbs` argument:
 11 | 
 12 | ```console
 13 | 
 14 | user@computer:~$ einfo -email name@xx.edu -dbs
 15 | annotinfo
 16 | assembly
 17 | biocollections
 18 | bioproject
 19 | biosample
 20 | biosystems
 21 | blastdbinfo
 22 | books
 23 | cdd
 24 | clinvar
 25 | dbvar
 26 | gap
 27 | gapplus
 28 | gds
 29 | gene
 30 | genome
 31 | geoprofiles
 32 | grasp
 33 | gtr
 34 | homologene
 35 | ipg
 36 | medgen
 37 | mesh
 38 | ncbisearch
 39 | nlmcatalog
 40 | nuccore
 41 | nucleotide
 42 | omim
 43 | orgtrack
 44 | pcassay
 45 | pccompound
 46 | pcsubstance
 47 | pmc
 48 | popset
 49 | protein
 50 | proteinclusters
 51 | protfam
 52 | pubmed
 53 | seqannot
 54 | snp
 55 | sra
 56 | structure
 57 | taxonomy
 58 | 
 59 | ```
 60 | 
 61 | ## PubChem Compound EDirect Fields, Links, and Data
 62 | 
 63 | This EDirectChemInfo repository focuses on searching the PubChem Compound, PubMed, and PubChem BioAssay databases. So let's take a closer look at these three databases, starting with the PubChem Compound (`-db pccompound`) database. The `einfo` arguments `-fields` and `-links` provide information about the available data fields and linked information, respectively:
 64 | 
 65 | ```console
 66 | 
 67 | user@computer:~$ einfo -email name@xx.edu -db pccompound -fields
 68 | AC	ActiveAidCount
 69 | ACC	AtomChiralCount
 70 | ACDC	AtomChiralDefCount
 71 | ACUC	AtomChiralUndefCount
 72 | ALL	All Fields
 73 | BCC	BondChiralCount
 74 | BCDC	BondChiralDefCount
 75 | BCUC	BondChiralUndefCount
 76 | CDAT	CreateDate
 77 | CPLX	Complexity
 78 | CSYN	CompleteSynonym
 79 | CUC	CovalentUnitCount
 80 | DCNT	DepositorCount
 81 | DCSY	DepositorCompleteSynonym
 82 | DSYN	DepositorSynonym
 83 | ELMT	Element
 84 | EMAS	ExactMass
 85 | FILT	Filter
 86 | HAC	HeavyAtomCount
 87 | HBAC	HydrogenBondAcceptorCount
 88 | HBDC	HydrogenBondDonorCount
 89 | IAC	IsotopeAtomCount
 90 | IKEY	InChIKey
 91 | INCH	InChI
 92 | MMAS	MonoisotopicMass
 93 | MSHT	MeSHTerm
 94 | MW	MolecularWeight
 95 | PAID	PharmActionID
 96 | PHMA	PharmAction
 97 | RBC	RotatableBondCount
 98 | SID	SubstanceID
 99 | SRCC	SourceCategory
100 | SRC	SourceName
101 | STID	StructureID
102 | SYNO	Synonym
103 | TAC	TotalAidCount
104 | TFC	TotalFormalCharge
105 | TPSA	TPSA
106 | UID	CompoundID
107 | UPAC	IUPACName
108 | XLGP	XLogP
109 | 
110 | user@computer:~$ einfo -email name@xx.edu -db pccompound -links
111 | pccompound_biosystems	BioSystems
112 | pccompound_gene	Gene
113 | pccompound_mesh	MeSH Keyword
114 | pccompound_nuccore	Nucleotide Sequences
115 | pccompound_omim	OMIM
116 | pccompound_pcassay	BioAssays
117 | pccompound_pcassay_active	BioAssays, Active
118 | pccompound_pcassay_activityconcmicromolar	BioAssays, activity concentration at/below 1 uM
119 | pccompound_pcassay_activityconcnanomolar	BioAssays, activity concentration at/below 1 nM
120 | pccompound_pcassay_inactive	BioAssays, Inactive
121 | pccompound_pcassay_probe	BioAssays, Probe
122 | pccompound_pccompound	Similar Compounds
123 | pccompound_pccompound_3d	Similar Conformers
124 | pccompound_pccompound_mixture	Mixture/Component Compounds
125 | pccompound_pccompound_parent	Parent Compound
126 | pccompound_pccompound_parent_connectivity_pulldown	Same Parent, Connectivity
127 | pccompound_pccompound_parent_isotopes_pulldown	Same Parent, Isotopes
128 | pccompound_pccompound_parent_pulldown	Same Parent
129 | pccompound_pccompound_parent_stereo_pulldown	Same Parent, Stereochemistry
130 | pccompound_pccompound_parent_tautomer_pulldown	Same Parent, Any Tautomer
131 | pccompound_pccompound_sameanytautomer_pulldown	Same, Any Tautomer
132 | pccompound_pccompound_sameconnectivity_pulldown	Same, Connectivity
133 | pccompound_pccompound_sameisotopic_pulldown	Same, Isotopes
134 | pccompound_pccompound_samestereochem_pulldown	Same, Stereochemistry
135 | pccompound_pcsubstance	PubChem Mixture Substances
136 | pccompound_pcsubstance_same	PubChem Same Substances
137 | pccompound_pmc	PMC Articles
138 | pccompound_protein	Protein Sequences
139 | pccompound_pubmed	PubMed Citations
140 | pccompound_pubmed_mesh	PubMed (MeSH Keyword)
141 | pccompound_pubmed_publisher	PubMed (Publisher)
142 | pccompound_structure	Protein Structures
143 | pccompound_taxonomy	Taxonomy
144 | 
145 | ```
146 | Now that we have an understanding about what kind of data is available in the PubChem Compound database, let's take a look at a PubChem Compound record using the `esearch` and `efetch` functions:
147 | 
148 | ```console
149 | 
150 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 512323[UID]
151 | <ENTREZ_DIRECT>
152 |   <Db>pccompound</Db>
153 |   <WebEnv>MCID...</WebEnv>
154 |   <QueryKey>1</QueryKey>
155 |   <Count>1</Count>
156 |   <Step>1</Step>
157 |   <Email>name@xx.edu</Email>
158 | </ENTREZ_DIRECT>
159 | 
160 | ```
161 | 
162 | We searched PubChem for the Compound Identifier 512323 using `esearch`, and the NCBI Entrez server returned a summary of the search results. The WebEnV and QueryKey specify the location of the search results on the NCBI server. In order to retrieve the data, we can pipe (`|`) the `esearch` results directly into the `efetch` function:
163 | 
164 | ```console
165 | 
166 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 512323[UID] | \
167 | > efetch -format docsum
168 | <?xml version="1.0" encoding="UTF-8" ?>
169 | <!DOCTYPE DocumentSummarySet PUBLIC "-//NLM//DTD esummary pccompound 20170720//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20170720/esummary_pccompound.dtd">
170 | <DocumentSummarySet status="OK">
171 |   <DbBuild>Build210125-0720m.1</DbBuild>
172 |   <DocumentSummary>
173 |     <Id>512323</Id>
174 |     <CID>512323</CID>
175 |     <SourceCategoryList>
176 |       <string>Chemical Vendors</string>
177 |       <string>Governmental Organizations</string>
178 |       <string>Subscription Services</string>
179 |       <string>Curation Efforts</string>
180 |       <string>Journal Publishers</string>
181 |       <string>Research and Development</string>
182 |       <string>Legacy Depositors</string>
183 |     </SourceCategoryList>
184 |     <CreateDate>2005/08/01 00:00</CreateDate>
185 |     <SynonymList>
186 |       <string>CHEMBL1791149</string>
187 |       <string>89647-10-9</string>
188 |       <string>Uridine, 2&apos;-deoxy-5-(2-thienyl)-</string>
189 |       <string>5-(2&apos;-Thienyl)-2&apos;-beta-deoxyuridine</string>
190 |       <string>SCHEMBL1635430</string>
191 |       <string>5-thien-2-yl-2&apos;-deoxyuridine</string>
192 |       <string>CTK2J2646</string>
193 |       <string>5-(2-thienyl)-2&apos;-deoxyuridine</string>
194 |       <string>5-(2&apos;-Thienyl)-2&apos;-deoxyuridine-</string>
195 |       <string>BDBM50407986</string>
196 |       <string>1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]-5-thiophen-2-ylpyrimidine-2,4-dione</string>
197 |       <string>1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)tetrahydrofuran-2-yl]-5-(2-thienyl)pyrimidine-2,4-dione</string>
198 |     </SynonymList>
199 |     <IUPACName>1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]-5-thiophen-2-ylpyrimidine-2,4-dione</IUPACName>
200 |     <CanonicalSmiles>C1C(C(OC1N2C=C(C(=O)NC2=O)C3=CC=CS3)CO)O</CanonicalSmiles>
201 |     <IsomericSmiles>C1[C@@H]([C@H](O[C@H]1N2C=C(C(=O)NC2=O)C3=CC=CS3)CO)O</IsomericSmiles>
202 |     <RotatableBondCount>3</RotatableBondCount>
203 |     <MolecularFormula>C13H14N2O5S</MolecularFormula>
204 |     <MolecularWeight>310.330</MolecularWeight>
205 |     <MolecularWeightSort>310330</MolecularWeightSort>
206 |     <TotalFormalCharge>0</TotalFormalCharge>
207 |     <XLogP>-0.2</XLogP>
208 |     <HydrogenBondDonorCount>3</HydrogenBondDonorCount>
209 |     <HydrogenBondAcceptorCount>6</HydrogenBondAcceptorCount>
210 |     <Complexity>483.000</Complexity>
211 |     <ComplexitySort>483000</ComplexitySort>
212 |     <HeavyAtomCount>21</HeavyAtomCount>
213 |     <AtomChiralCount>3</AtomChiralCount>
214 |     <AtomChiralDefCount>3</AtomChiralDefCount>
215 |     <AtomChiralUndefCount>0</AtomChiralUndefCount>
216 |     <BondChiralCount>0</BondChiralCount>
217 |     <BondChiralDefCount>0</BondChiralDefCount>
218 |     <BondChiralUndefCount>0</BondChiralUndefCount>
219 |     <IsotopeAtomCount>0</IsotopeAtomCount>
220 |     <CovalentUnitCount>1</CovalentUnitCount>
221 |     <TPSA>127</TPSA>
222 |     <ActiveAidCount>1</ActiveAidCount>
223 |     <TotalAidCount>37</TotalAidCount>
224 |     <InChIKey>PCDQBRGMSMVLDZ-IQJOONFLSA-N</InChIKey>
225 |     <ProbeAidCount>0</ProbeAidCount>
226 |     <InChI>InChI=1S/C13H14N2O5S/c16-6-9-8(17)4-11(20-9)15-5-7(10-2-1-3-21-10)12(18)14-13(15)19/h1-3,5,8-9,11,16-17H,4,6H2,(H,14,18,19)/t8-,9+,11+/m0/s1</InChI>
227 |   </DocumentSummary>
228 | </DocumentSummarySet>
229 | 
230 | ```
231 | 
232 | `efetch` returned the CID record data as document summary XML format (for other formats see `efetch -help`). XML is useful, but we probably want to parse the data into a table for easier viewing and analysis. EDirect contains a function called `xtract` that can convert the Entrez XML data into tables. See `xtract -help` for more information. In brief, you will need to select a main XML heading tag to define the extract pattern and then specify the data you want to extract with the sub-heading tag names (elements). For example, in the above CID record data, we can set the pattern to DocumentSummary (the first main XML tag in this case), and then the elements to a few of the sub-heading tags we are interested such as IsomericSmiles, CID, InChIKey, MolecularFormula, and MolecularWeight:
233 | 
234 | ```console
235 | 
236 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 512323[UID] | \
237 | > efetch -format docsum | \
238 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey MolecularFormula MolecularWeight
239 | C1[C@@H]([C@H](O[C@H]1N2C=C(C(=O)NC2=O)C3=CC=CS3)CO)O	512323	PCDQBRGMSMVLDZ-IQJOONFLSA-N	C13H14N2O5S	310.330
240 | 
241 | ```
242 | 
243 | ## PubMed EDirect Fields, Links, and Data
244 | 
245 | Similarly to PubChem Compound, let's preview the available PubMed database indexed fields, links, and data structure:
246 | 
247 | ```console
248 | 
249 | user@computer:~$ einfo -email name@xx.edu -db pubmed -fields
250 | AFFL	Affiliation
251 | ALL	All Fields
252 | AUCL	Author Cluster ID
253 | AUID	Author - Identifier
254 | AUTH	Author
255 | BOOK	Book
256 | CDAT	Date - Completion
257 | CNTY	Place of Publication
258 | COIS	Conflict of Interest Statements
259 | COLN	Author - Corporate
260 | CRDT	Date - Create
261 | DSO	DSO
262 | ECNO	EC/RN Number
263 | EDAT	Date - Entrez
264 | ED	Editor
265 | EID	Extended PMID
266 | EPDT	Electronic Publication Date
267 | FAUT	Author - First
268 | FILT	Filter
269 | FINV	Investigator - Full
270 | FULL	Author - Full
271 | GRNT	Grant Number
272 | INVR	Investigator
273 | ISBN	ISBN
274 | ISS	Issue
275 | JOUR	Journal
276 | LANG	Language
277 | LAUT	Author - Last
278 | LID	Location ID
279 | MAJR	MeSH Major Topic
280 | MDAT	Date - Modification
281 | MESH	MeSH Terms
282 | MHDA	Date - MeSH
283 | OTRM	Other Term
284 | PAGE	Pagination
285 | PAPX	Pharmacological Action
286 | PDAT	Date - Publication
287 | PID	Publisher ID
288 | PPDT	Print Publication Date
289 | PS	Subject - Personal Name
290 | PTYP	Publication Type
291 | PUBN	Publisher
292 | SI	Secondary Source ID
293 | SUBH	MeSH Subheading
294 | SUBS	Supplementary Concept
295 | TIAB	Title/Abstract
296 | TITL	Title
297 | TT	Transliterated Title
298 | UID	UID
299 | VOL	Volume
300 | WORD	Text Word
301 | 
302 | user@computer:~$ einfo -email name@xx.edu -db pubmed -links
303 | pubmed_assembly	Assembly
304 | pubmed_bioproject	Project Links
305 | pubmed_biosample	BioSample Links
306 | pubmed_biosystems	BioSystem Links
307 | pubmed_books_refs	Cited in Books
308 | pubmed_cdd	Conserved Domain Links
309 | pubmed_clinvar_calculated	ClinVar (calculated)
310 | pubmed_clinvar	ClinVar
311 | pubmed_dbvar	dbVar
312 | pubmed_gap	dbGaP Links
313 | pubmed_gds	GEO DataSet Links
314 | pubmed_gene_bookrecords	Gene (from Bookshelf)
315 | pubmed_gene_citedinomim	Gene (OMIM) Links
316 | pubmed_gene	Gene Links
317 | pubmed_gene_pmc_nucleotide	Gene (nucleotide/PMC)
318 | pubmed_gene_rif	Gene (GeneRIF) Links
319 | pubmed_genome	Genome Links
320 | pubmed_geoprofiles	GEO Profile Links
321 | pubmed_homologene	HomoloGene Links
322 | pubmed_medgen_bookshelf_cited	MedGen (Bookshelf cited)
323 | pubmed_medgen_genereviews	MedGen (GeneReviews)
324 | pubmed_medgen	MedGen
325 | pubmed_medgen_omim	MedGen (OMIM)
326 | pubmed_nuccore	Nucleotide Links
327 | pubmed_nuccore_refseq	Nucleotide (RefSeq) Links
328 | pubmed_nuccore_weighted	Nucleotide (Weighted) Links
329 | pubmed_omim_bookrecords	OMIM (from Bookshelf)
330 | pubmed_omim_calculated	OMIM (calculated) Links
331 | pubmed_omim_cited	OMIM (cited) Links
332 | pubmed_pcassay	PubChem BioAssay
333 | pubmed_pccompound_mesh	PubChem Compound (MeSH Keyword)
334 | pubmed_pccompound	PubChem Compound
335 | pubmed_pccompound_publisher	PubChem Compound (Publisher)
336 | pubmed_pcsubstance_bookrecords	PubChem Substance (from Bookshelf)
337 | pubmed_pcsubstance	PubChem Substance Links
338 | pubmed_pcsubstance_publisher	PubChem Substance (Publisher)
339 | pubmed_pmc_bookrecords	References in PMC for this Bookshelf citation
340 | pubmed_pmc_embargo
341 | pubmed_pmc_local
342 | pubmed_pmc	PMC Links
343 | pubmed_pmc_refs	Cited in PMC
344 | pubmed_popset	PopSet Links
345 | pubmed_probe	Probe Links
346 | pubmed_proteinclusters	Protein Cluster Links
347 | pubmed_protein	Protein Links
348 | pubmed_protein_refseq	Protein (RefSeq) Links
349 | pubmed_protein_weighted	Protein (Weighted) Links
350 | pubmed_protfam	Protein Family Models
351 | pubmed_pubmed_alsoviewed	Articles frequently viewed together
352 | pubmed_pubmed_bookrecords	References for this Bookshelf citation
353 | pubmed_pubmed_refs	References for PMC Articles
354 | pubmed_pubmed	Similar articles
355 | pubmed_snp_cited	SNP (Cited)
356 | pubmed_snp	SNP Links
357 | pubmed_sra	SRA Links
358 | pubmed_structure	Structure Links
359 | pubmed_taxonomy_entrez	Taxonomy via GenBank
360 | 
361 | ```
362 | 
363 | And here is an example PubMed article record in abstract form:
364 | 
365 | ```console
366 | 
367 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17630804"[PMID] |\
368 | > efetch -format abstract
369 | 
370 | 1. J Org Chem. 2007 Aug 17;72(17):6621-3. Epub 2007 Jul 14.
371 | 
372 | Total synthesis and absolute configuration determination of (+)-bruguierol C.
373 | 
374 | Solorio DM(1), Jennings MP.
375 | 
376 | Author information: 
377 | (1)Department of Chemistry, 500 Campus Drive, The University of Alabama,
378 | Tuscaloosa, Alabama 35487-0336, USA.
379 | 
380 | The first total synthesis and absolute configuration of bruguierol C are
381 | reported. The key step involved the diastereoselective capture of an in situ
382 | generated oxocarbenium ion via an intramolecular Friedel-Crafts alkylation.
383 | 
384 | DOI: 10.1021/jo071035l 
385 | PMID: 17630804  [Indexed for MEDLINE]
386 | 
387 | ```
388 | 
389 | and the same record in XML format:
390 | 
391 | ```console
392 | 
393 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17630804"[PMID] |\
394 | > efetch -format xml
395 | <?xml version="1.0" encoding="UTF-8" ?>
396 | <!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
397 | <PubmedArticleSet>
398 |   <PubmedArticle>
399 |     <MedlineCitation Status="MEDLINE" Owner="NLM">
400 |       <PMID Version="1">17630804</PMID>
401 |       <DateCompleted>
402 |         <Year>2007</Year>
403 |         <Month>10</Month>
404 |         <Day>25</Day>
405 |       </DateCompleted>
406 |       <DateRevised>
407 |         <Year>2007</Year>
408 |         <Month>08</Month>
409 |         <Day>10</Day>
410 |       </DateRevised>
411 |       <Article PubModel="Print-Electronic">
412 |         <Journal>
413 |           <ISSN IssnType="Print">0022-3263</ISSN>
414 |           <JournalIssue CitedMedium="Print">
415 |             <Volume>72</Volume>
416 |             <Issue>17</Issue>
417 |             <PubDate>
418 |               <Year>2007</Year>
419 |               <Month>Aug</Month>
420 |               <Day>17</Day>
421 |             </PubDate>
422 |           </JournalIssue>
423 |           <Title>The Journal of organic chemistry</Title>
424 |           <ISOAbbreviation>J Org Chem</ISOAbbreviation>
425 |         </Journal>
426 |         <ArticleTitle>Total synthesis and absolute configuration determination of (+)-bruguierol C.</ArticleTitle>
427 |         <Pagination>
428 |           <MedlinePgn>6621-3</MedlinePgn>
429 |         </Pagination>
430 |         <Abstract>
431 |           <AbstractText>The first total synthesis and absolute configuration of bruguierol C are reported. The key step involved the diastereoselective capture of an in situ generated oxocarbenium ion via an intramolecular Friedel-Crafts alkylation.</AbstractText>
432 |         </Abstract>
433 |         <AuthorList CompleteYN="Y">
434 |           <Author ValidYN="Y">
435 |             <LastName>Solorio</LastName>
436 |             <ForeName>Dionicio Martinez</ForeName>
437 |             <Initials>DM</Initials>
438 |             <AffiliationInfo>
439 |               <Affiliation>Department of Chemistry, 500 Campus Drive, The University of Alabama, Tuscaloosa, Alabama 35487-0336, USA.</Affiliation>
440 |             </AffiliationInfo>
441 |           </Author>
442 |           <Author ValidYN="Y">
443 |             <LastName>Jennings</LastName>
444 |             <ForeName>Michael P</ForeName>
445 |             <Initials>MP</Initials>
446 |           </Author>
447 |         </AuthorList>
448 |         <Language>eng</Language>
449 |         <PublicationTypeList>
450 |           <PublicationType UI="D016428">Journal Article</PublicationType>
451 |           <PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
452 |           <PublicationType UI="D013486">Research Support, U.S. Gov't, Non-P.H.S.</PublicationType>
453 |         </PublicationTypeList>
454 |         <ArticleDate DateType="Electronic">
455 |           <Year>2007</Year>
456 |           <Month>07</Month>
457 |           <Day>14</Day>
458 |         </ArticleDate>
459 |       </Article>
460 |       <MedlineJournalInfo>
461 |         <Country>United States</Country>
462 |         <MedlineTA>J Org Chem</MedlineTA>
463 |         <NlmUniqueID>2985193R</NlmUniqueID>
464 |         <ISSNLinking>0022-3263</ISSNLinking>
465 |       </MedlineJournalInfo>
466 |       <ChemicalList>
467 |         <Chemical>
468 |           <RegistryNumber>0</RegistryNumber>
469 |           <NameOfSubstance UI="D006575">Heterocyclic Compounds, 3-Ring</NameOfSubstance>
470 |         </Chemical>
471 |         <Chemical>
472 |           <RegistryNumber>0</RegistryNumber>
473 |           <NameOfSubstance UI="C523709">bruguierol C</NameOfSubstance>
474 |         </Chemical>
475 |       </ChemicalList>
476 |       <CitationSubset>IM</CitationSubset>
477 |       <MeshHeadingList>
478 |         <MeshHeading>
479 |           <DescriptorName UI="D006575" MajorTopicYN="N">Heterocyclic Compounds, 3-Ring</DescriptorName>
480 |           <QualifierName UI="Q000138" MajorTopicYN="N">chemical synthesis</QualifierName>
481 |           <QualifierName UI="Q000737" MajorTopicYN="Y">chemistry</QualifierName>
482 |         </MeshHeading>
483 |         <MeshHeading>
484 |           <DescriptorName UI="D009682" MajorTopicYN="N">Magnetic Resonance Spectroscopy</DescriptorName>
485 |         </MeshHeading>
486 |         <MeshHeading>
487 |           <DescriptorName UI="D015394" MajorTopicYN="Y">Molecular Structure</DescriptorName>
488 |         </MeshHeading>
489 |         <MeshHeading>
490 |           <DescriptorName UI="D021241" MajorTopicYN="N">Spectrometry, Mass, Electrospray Ionization</DescriptorName>
491 |         </MeshHeading>
492 |         <MeshHeading>
493 |           <DescriptorName UI="D013055" MajorTopicYN="N">Spectrophotometry, Infrared</DescriptorName>
494 |         </MeshHeading>
495 |         <MeshHeading>
496 |           <DescriptorName UI="D013237" MajorTopicYN="N">Stereoisomerism</DescriptorName>
497 |         </MeshHeading>
498 |       </MeshHeadingList>
499 |     </MedlineCitation>
500 |     <PubmedData>
501 |       <History>
502 |         <PubMedPubDate PubStatus="pubmed">
503 |           <Year>2007</Year>
504 |           <Month>7</Month>
505 |           <Day>17</Day>
506 |           <Hour>9</Hour>
507 |           <Minute>0</Minute>
508 |         </PubMedPubDate>
509 |         <PubMedPubDate PubStatus="medline">
510 |           <Year>2007</Year>
511 |           <Month>10</Month>
512 |           <Day>27</Day>
513 |           <Hour>9</Hour>
514 |           <Minute>0</Minute>
515 |         </PubMedPubDate>
516 |         <PubMedPubDate PubStatus="entrez">
517 |           <Year>2007</Year>
518 |           <Month>7</Month>
519 |           <Day>17</Day>
520 |           <Hour>9</Hour>
521 |           <Minute>0</Minute>
522 |         </PubMedPubDate>
523 |       </History>
524 |       <PublicationStatus>ppublish</PublicationStatus>
525 |       <ArticleIdList>
526 |         <ArticleId IdType="pubmed">17630804</ArticleId>
527 |         <ArticleId IdType="doi">10.1021/jo071035l</ArticleId>
528 |       </ArticleIdList>
529 |     </PubmedData>
530 |   </PubmedArticle>
531 | </PubmedArticleSet>
532 | 
533 | ```
534 | The above returned XML PubMed record is hard to understand as it has many fields. We can use the `xtract -outline` argument to present a structured view of only the XML data tags:
535 | 
536 | ```console
537 | 
538 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17630804"[PMID] | \
539 | > efetch -format xml | \
540 | > xtract -outline
541 | PubmedArticle
542 |   MedlineCitation
543 |     PMID
544 |     DateCompleted
545 |       Year
546 |       Month
547 |       Day
548 |     DateRevised
549 |       Year
550 |       Month
551 |       Day
552 |     Article
553 |       Journal
554 |         ISSN
555 |         JournalIssue
556 |           Volume
557 |           Issue
558 |           PubDate
559 |             Year
560 |             Month
561 |             Day
562 |         Title
563 |         ISOAbbreviation
564 |       ArticleTitle
565 |       Pagination
566 |         MedlinePgn
567 |       Abstract
568 |         AbstractText
569 |       AuthorList
570 |         Author
571 |           LastName
572 |           ForeName
573 |           Initials
574 |           AffiliationInfo
575 |             Affiliation
576 |         Author
577 |           LastName
578 |           ForeName
579 |           Initials
580 |       Language
581 |       PublicationTypeList
582 |         PublicationType
583 |         PublicationType
584 |         PublicationType
585 |       ArticleDate
586 |         Year
587 |         Month
588 |         Day
589 |     MedlineJournalInfo
590 |       Country
591 |       MedlineTA
592 |       NlmUniqueID
593 |       ISSNLinking
594 |     ChemicalList
595 |       Chemical
596 |         RegistryNumber
597 |         NameOfSubstance
598 |       Chemical
599 |         RegistryNumber
600 |         NameOfSubstance
601 |     CitationSubset
602 |     MeshHeadingList
603 |       MeshHeading
604 |         DescriptorName
605 |         QualifierName
606 |         QualifierName
607 |       MeshHeading
608 |         DescriptorName
609 |       MeshHeading
610 |         DescriptorName
611 |       MeshHeading
612 |         DescriptorName
613 |       MeshHeading
614 |         DescriptorName
615 |       MeshHeading
616 |         DescriptorName
617 |   PubmedData
618 |     History
619 |       PubMedPubDate
620 |         Year
621 |         Month
622 |         Day
623 |         Hour
624 |         Minute
625 |       PubMedPubDate
626 |         Year
627 |         Month
628 |         Day
629 |         Hour
630 |         Minute
631 |       PubMedPubDate
632 |         Year
633 |         Month
634 |         Day
635 |         Hour
636 |         Minute
637 |     PublicationStatus
638 |     ArticleIdList
639 |       ArticleId
640 |       ArticleId
641 | 
642 | ```
643 | 
644 | The above structured output makes it easier to view the XML formatting and determine which data elements we are interested in extracting out with `xtract`, such as the PMID, Author/LastName, Author/Initials, ISOAbbreviation, ArticleTitle, PubDate, Volume, Issue, and MedlinePgn:
645 | 
646 | ```console
647 | 
648 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17630804"[PMID] | \
649 | > efetch -format xml | \
650 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName Author/Initials \
651 | > ISOAbbreviation ArticleTitle PubDate/Year Volume Issue MedlinePgn
652 | 17630804	Solorio	DM	J. Org. Chem.	Total synthesis and absolute configuration determination of (+)-bruguierol C.	2007	72	17	6621-3
653 | 
654 | ```
655 | Note that in the above `xtract` argument, the element selection is specified with `-first`, so that only the first occurrence is extracted (e.g., first Author).
656 | 
657 | ## PubChem BioAssay EDirect Fields, Links, and Data
658 | 
659 | Lastly, let's take a look at the PubChem BioAssay database indexed fields, links, and data structure:
660 | 
661 | ```console
662 | 
663 | user@computer:~$ einfo -email name@xx.edu -db pcassay -fields
664 | AC	Active Sid Count
665 | ACMD	Activity Outcome Method
666 | ACMT	Assay Comment
667 | ADES	Assay Description
668 | ALL	All Fields
669 | ANAM	Assay Name
670 | APRJ	Assay Project
671 | APRL	Assay Protocol
672 | ASRD	Assay Source ID
673 | BSID	BioSystems ID
674 | CCMT	Categorized Comment
675 | CCT	Categorized Comment Title
676 | CELL	Cell Line
677 | CSNM	Current Source Name
678 | DDAT	Deposit Date
679 | DTMD	Detection Method
680 | FILT	Filter
681 | GBAC	GenBank Accession
682 | GRN	Grant Number
683 | GSYM	Gene Symbol
684 | HDAT	Hold Until Date
685 | JNAM	Journal Name
686 | MDAT	Modify Date
687 | NARD	Nucleic Acid Reagent ID
688 | NSAM	Number of Sids With Activity Concentration micromolar
689 | NSAN	Number of Sids With Activity Concentration nanomolar
690 | ORGN	Organism
691 | PCC	Probe Cid Count
692 | PIGI	Pig GI
693 | PSC	Probe Sid Count
694 | PTGI	Protein Target GI
695 | PTN	Protein Target Name
696 | RTGI	RNA Target GI
697 | SNME	Source Name
698 | SRCC	Source Category
699 | SYNT	Synonym Tested
700 | TCNT	Target Count
701 | TSC	Total Sid Count
702 | TXNM	Taxonomy Name
703 | UID	Assay ID
704 | UPAC	UniProt Accession
705 | 
706 | user@computer:~$ einfo -email name@xx.edu -db pcassay -links
707 | pcassay_books_probe	MLP Chemical Probe Report
708 | pcassay_cdd_protein_target	Conserved Domains (Full) via Protein Target
709 | pcassay_gene_rnai	RNAi Target, Tested
710 | pcassay_gene_rnai_active	RNAi Target, Active
711 | pcassay_gene_target	Gene Target
712 | pcassay_nuccore	Nucleotide
713 | pcassay_nuccore_rna_target	Nucleotide RNA Target
714 | pcassay_omim	OMIM
715 | pcassay_pcassay_activityneighbor_list	Related BioAssays, by Activity Overlap (List)
716 | pcassay_pcassay_assay_project	Related Assay Projects
717 | pcassay_pcassay_common_gene_list	Related BioAssays, by Common Active Gene (List)
718 | pcassay_pcassay_gene_interaction_list	Related BioAssays, by Gene Interaction (List)
719 | pcassay_pcassay_neighbor_list	Related BioAssays, by Depositor (List)
720 | pcassay_pcassay_same_assay_project_list	Related BioAssays, by Same Project (List)
721 | pcassay_pcassay_same_publication_list	Related BioAssays, by Same Publication (List)
722 | pcassay_pcassay_similar_publication_list	Related BioAssays, by Similar Publication (List)
723 | pcassay_pcassay_targetneighbor_list	Related BioAssays, by Target Similarity (List)
724 | pcassay_pccompound	Compounds
725 | pcassay_pccompound_active	Compounds, Active
726 | pcassay_pccompound_activityconcmicromolar	Compounds, activity concentration at/below 1 uM
727 | pcassay_pccompound_activityconcnanomolar	Compounds, activity concentration at/below 1 nM
728 | pcassay_pccompound_inactive	Compounds, Inactive
729 | pcassay_pccompound_probe	Compounds, Probe
730 | pcassay_pcsubstance	Substances
731 | pcassay_pcsubstance_active	Substances, Active
732 | pcassay_pcsubstance_activityconcmicromolar	Substances, activity concentration at/below 1 uM
733 | pcassay_pcsubstance_activityconcnanomolar	Substances, activity concentration at/below 1 nM
734 | pcassay_pcsubstance_inactive	Substances, Inactive
735 | pcassay_pcsubstance_probe	Substances, Probe
736 | pcassay_pmc	PMC Articles
737 | pcassay_probe	Nucleic acid reagent
738 | pcassay_protein_target	Protein Target
739 | pcassay_protein_target_pig	Protein Target, Identical Sequence
740 | pcassay_pubmed	PubMed Citations
741 | pcassay_sparcle_target	Target Functional Class
742 | pcassay_structure	Protein Structures
743 | pcassay_taxonomy	Taxonomy
744 | ```
745 | 
746 | And here is an example PubChem BioAssay record in docsum XML:
747 | 
748 | ```console
749 | 
750 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "1236573"[UID] | \
751 | > efetch -format docsum
752 | <?xml version="1.0" encoding="UTF-8" ?>
753 | <!DOCTYPE DocumentSummarySet PUBLIC "-//NLM//DTD esummary pcassay 20161116//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20161116/esummary_pcassay.dtd">
754 | 
755 | <DocumentSummarySet status="OK">
756 | <DbBuild>Build200617-0952.1</DbBuild>
757 | 
758 | <DocumentSummary><Id>1236573</Id>
759 | 	<AssayName>Cytotoxicity against human NCI/ADR cells</AssayName>
760 | 	<CellLine></CellLine>
761 | 	<AssayDescription>Title: Progress Toward the Development of Noscapine and Derivatives as Anticancer Agents.  Abstract: Many nitrogen-moiety containing alkaloids derived from plant origins are bioactive and play a significant role in human health and emerging medicine. Noscapine, a phthalideisoquinoline alkaloid derived from Papaver somniferum, has been used as a cough suppressant since the mid 1950s, illustrating a good safety profile. Noscapine has since been discovered to arrest cells at mitosis, albeit with moderately weak activity. Immunofluorescence staining of microtubules after 24 h of noscapine exposure at 20  inverted question markM elucidated chromosomal abnormalities and the inability of chromosomes to complete congression to the equatorial plane for proper mitotic separation ( Proc. Natl. Acad. Sci. U. S. A. 1998 , 95 , 1601 - 1606 ). A number of noscapine analogues possessing various modifications have been described within the literature and have shown significantly improved antiprolific profiles for a large variety of cancer cell lines. Several semisynthetic antimitotic alkaloids are emerging as possible candidates as novel anticancer therapies. This perspective discusses the advancing understanding of noscapine and related analogues in the fight against malignant disease. </AssayDescription>
762 | 	<AssaySourceID>1506980</AssaySourceID>
763 | 	<SourceNameList>
764 | 		<string>ChEMBL</string>
765 | 	</SourceNameList>
766 | 	<CurrentSourceName>ChEMBL</CurrentSourceName>
767 | 	<DetectionMethod></DetectionMethod>
768 | 	<ActiveSidCount>1</ActiveSidCount>
769 | 	<ActivityOutcomeMethod>Confirmatory</ActivityOutcomeMethod>
770 | 	<TotalSidCount>1</TotalSidCount>
771 | 	<OnHold>No</OnHold>
772 | 	<ModifyDate>2018/10/08 00:00</ModifyDate>
773 | 	<DepositDate>2016/12/22 00:00</DepositDate>
774 | 	<HoldUntilDate>1/01/01 00:00</HoldUntilDate>
775 | 	<AID>1236573</AID>
776 | 	<ProbeSidCount>0</ProbeSidCount>
777 | 	<TargetCount>0</TargetCount>
778 | 	<NumberofSidsWithActivityConcmicromolar>1</NumberofSidsWithActivityConcmicromolar>
779 | 	<NumberofSidsWithActivityConcnanomolar>1</NumberofSidsWithActivityConcnanomolar>
780 | 	<ProteinTargetList>
781 | 	</ProteinTargetList>
782 | </DocumentSummary>
783 | 
784 | </DocumentSummarySet>
785 | ```
786 | Similarly to the PubChem Compound and PubMed data, we can extract out specific data using the `xtract` function such as the AID, CurrentSourceName, AssayName, ActiveSidCount, and TargetCount:
787 | 
788 | ```console
789 | 
790 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "1236573"[UID] | \
791 | > efetch -format docsum | \
792 | > xtract -pattern DocumentSummary -element AID CurrentSourceName AssayName ActiveSidCount TargetCount
793 | 1236573	ChEMBL	Cytotoxicity against human NCI/ADR cells	1	0
794 | ```
795 | 
796 | 


--------------------------------------------------------------------------------