├── images
├── mview01.png
├── mview02.png
├── obabel01.png
└── obabel02.png
├── Workshops
└── EDirect Workshop_Scalfani_sp2021.pdf
├── ACS_sp2021_talk
└── Scalfani_VF_EDirect_ACS_sp2021.pdf
├── LICENSE
├── README.md
├── 01_EDirect_Intro.md
├── 06_EDirect_Combining_Tools.md
├── 04_EDirect_PubChem_Recipes.md
├── 05_EDirect_PubMed_Recipes.md
├── 03_EDirect_PubChem_BioAssay_PubMed_Recipes.md
└── 02_EDirect_Data_Fields_Structure.md
/images/mview01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/images/mview01.png
--------------------------------------------------------------------------------
/images/mview02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/images/mview02.png
--------------------------------------------------------------------------------
/images/obabel01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/images/obabel01.png
--------------------------------------------------------------------------------
/images/obabel02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/images/obabel02.png
--------------------------------------------------------------------------------
/Workshops/EDirect Workshop_Scalfani_sp2021.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/Workshops/EDirect Workshop_Scalfani_sp2021.pdf
--------------------------------------------------------------------------------
/ACS_sp2021_talk/Scalfani_VF_EDirect_ACS_sp2021.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/UA-Libraries-Research-Data-Services/EDirectChemInfo/master/ACS_sp2021_talk/Scalfani_VF_EDirect_ACS_sp2021.pdf
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 Vincent F. Scalfani
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # EDirectChemInfo
2 |
3 | **Notes**
4 |
5 | > Oct 21, 2024 - This repository has recently been transferred from The University of Alabama Libraries Web Services GitHub to The University of Alabama Libraries Research Data Services GitHub organization.
6 | > All GitHub related hyperlinks should automatically redirect to the new GitHub location, but if you notice anything that is not working correctly, please let us know.
7 |
8 | This repository contains Entrez Direct (EDirect, an NCBI tool) Unix scripts for programmatically obtaining data from various NCBI databases. Other EDirect resources and guides exist (referenced below). This EDirectChemInfo repository differs in that the focus is on teaching how to obtain chemical information, cheminformatics data, and chemical structure <--> bioassay <--> document relationship links. There are not many PubChem EDirect examples available, so hopefully this repository proves useful. I have also added some tips, step-wise directions, and code output examples to help you get started.
9 |
10 | Please note that this EDirectChemInfo repository is not affiliated with NCBI. You should contact [NCBI](https://www.ncbi.nlm.nih.gov/books/NBK179288/#_chapter6_For_More_Information_) for specific questions related to EDirect. This repository was created to accompany library instruction at The University of Alabama. With that in mind, please feel free to open a GitHub Issue or contact me directly with comments/questions if you think there is something I can help you with. In addition, if this repository has been a useful resource for you, please do let me know as this type of feedback can help prioritize my time.
11 |
12 | Vincent Scalfani\
13 | Science and Engineering Librarian\
14 | The University of Alabama\
15 | [UA Libraries Directory](https://www.lib.ua.edu/#/staffdir?liaison=1&search=scalfani)
16 |
17 | ## Contents
18 |
19 | * [What is EDirect?](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md)
20 | * [Installation Tips](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md#installation-tips)
21 | * [Usage Tips](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md#usage-tips)
22 | * [EDirect Function Help and Debug](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md#e-utility-application-help)
23 | * [Available Databases, Data Fields, and Data Structures](https://github.com/vfscalfani/EDirectChemInfo/blob/master/02_EDirect_Data_Fields_Structure.md)
24 | * [PubChem <--> PubChem BioAssay <--> PubMed EDirect Recipes](https://github.com/vfscalfani/EDirectChemInfo/blob/master/03_EDirect_PubChem_BioAssay_PubMed_Recipes.md)
25 | * [PubChem EDirect Recipes](https://github.com/vfscalfani/EDirectChemInfo/blob/master/04_EDirect_PubChem_Recipes.md)
26 | * [PubMed EDirect Recipes](https://github.com/vfscalfani/EDirectChemInfo/blob/master/05_EDirect_PubMed_Recipes.md)
27 | * [Combining EDirect Results with Chemical Depiction and Plotting](https://github.com/vfscalfani/EDirectChemInfo/blob/master/06_EDirect_Combining_Tools.md)
28 |
29 | ## References
30 |
31 | These are the main references I used to learn about NCBI E-Utilities, the EDirect syntax, Unix commands/scripts, and the importance of linked chemical data. Many thanks to the authors for their work.
32 |
33 | 1. [NCBI Documentation for Entrez Direct: E-utilities on the UNIX Command Line](https://www.ncbi.nlm.nih.gov/books/NBK179288/)
34 | 2. [NIH NLM The Insider's Guide to Accessing NLM Data](https://dataguide.nlm.nih.gov/)
35 | 3. [NCBI EDirect Cookbook](https://github.com/NCBI-Hackathons/EDirectCookbook)
36 | 4. [Computational Genomics Manual: NCBI EDirect](https://github.com/linsalrob/ComputationalGenomicsManual/blob/master/Databases/NCBI_Edirect.md)
37 | 5. [Entrez Link Descriptions](https://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html)
38 | 6. [Software Carpentry: The Unix Shell](https://swcarpentry.github.io/shell-novice/)
39 | 7. [Opening up connectivity between documents, structures and bioactivity by Christopher Southan](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7136548/)
40 |
41 |
42 | ## License Notes
43 |
44 | Code in this repository is licensed under the [MIT License](https://github.com/vfscalfani/EDirectChemInfo/blob/master/LICENSE). Some of the chemical depiction demonstrations from EDirect output use proprietary software, such as ChemAxon Marvin, which is not included under this license. Users must have valid licenses for any required proprietary software to run these portions of the code.
45 |
46 | Code output (e.g., reference/molecular data snippets) retrieved from NCBI via their EDirect utility is shown for code demonstration purposes only and is credited to NCBI and NLM. Please see the [NCBI Website and Data Usage Policies and Disclaimers](https://www.ncbi.nlm.nih.gov/home/about/policies/) for more information regarding the data.
47 |
48 |
--------------------------------------------------------------------------------
/01_EDirect_Intro.md:
--------------------------------------------------------------------------------
1 | # What is EDirect?
2 |
3 | EDirect is a Unix command line tool from NCBI that allows programmatic retrieval of chemical/biological data and literature references from NCBI databases. EDirect reduces the barrier to accessing NCBI data programmatically; that is, with a basic knowledge of the Unix shell (e.g., bash), it is straightforward to obtain and format your own custom datasets, often with only a few lines of code. Moreover, you can input data retrieved from EDirect into other Unix tools for quick viewing and analysis ([Pipeline (Unix))](https://en.wikipedia.org/wiki/Pipeline_(Unix)).
4 |
5 | ## Installation Tips
6 |
7 | Follow the installation instructions from NCBI: [Entrez Direct: E-utilities on the UNIX Command Line](https://www.ncbi.nlm.nih.gov/books/NBK179288/). There are several different methods to install EDirect. I used option 3 (EDirect v14.4) with `wget` in Gnome Terminal on a Linux Ubuntu 18.04 workstation. If you are using Windows, NCBI mentions that you can use the Cygwin Unix emulator. Another option for Windows users is to setup a Linux virtual machine. There are many tutorials for setting up virtual machines. For example, here is one for installing [Ubuntu on VirtualBox](https://askubuntu.com/questions/142549/how-to-install-ubuntu-on-virtualbox). When installing EDirect in a virtual machine, you may need to customize the VirtualBox network settings in order to use the `curl` or `wget` EDirect installation methods. In my testing on an Ubuntu 20.04 virtual machine, the fourth installation option for EDirect (using the longer perl script) worked fine with the standard VirtualBox network settings.
8 |
9 | ## Usage Tips
10 |
11 | NCBI has specific data usage policies and disclaimers:
12 |
13 | * [NCBI Website and Data Usage Policies and Disclaimers](https://www.ncbi.nlm.nih.gov/home/about/policies/)
14 | * [Entrez Programming Utilities Help](https://www.ncbi.nlm.nih.gov/books/NBK25501/)
15 |
16 | If you do not follow NCBI's usage policies (e.g., no more than 3 requests per second), NCBI may block your IP address. So be cautious and follow good programming practices of testing and adding sleep delays, particularly if executing multiple sequential calls in a loop. Moreover, it is always a good idea to include your email address in the requests so that NCBI can contact you if necessary. You can add your email address within each query like this:
17 |
18 | ```console
19 |
20 | user@computer:~$ e-function -email name@xx.edu -arg input
21 |
22 | ```
23 | Replace `name@xx.edu` with your email address. The `e-function` is a place holder for one of the actual EDirect functions like `einfo` or `esearch`, and `-arg input` is a placeholder for e-function argument(s) like `-db pccompound` or `-db pubmed -query "food allergies"`.
24 |
25 | ## EDirect Function Help
26 |
27 | I generally refer to the official [Entrez Programming Utilities Help Document](https://www.ncbi.nlm.nih.gov/books/NBK25501/) or the [NIH NLM E-Utilities Documentation](https://dataguide.nlm.nih.gov/eutilities/utilities.html), however for a quick reference or reminder of the proper syntax, the `-help` option is useful. Here is an example with the `einfo` function:
28 |
29 | ```console
30 |
31 | user@computer:~$ einfo -help
32 | einfo 14.4
33 |
34 | Database Selection
35 |
36 | -dbs Print all database names
37 | -db Database name (or "all")
38 |
39 | Data Summaries
40 |
41 | -fields Print field names
42 | -links Print link names
43 |
44 | Field Example
45 |
46 |
47 | ALL
48 | All Fields
49 | All terms from all searchable fields
50 | 245340803
51 | N
52 | N
53 | N
54 | N
55 | N
56 | Y
57 | N
58 |
59 |
60 | Link Example
61 |
62 |
63 | pubmed_protein
64 |
65 | Published protein sequences
66 | protein
67 |
68 |
69 | pubmed_protein_refseq
70 |
71 | Link to Protein RefSeqs
72 | protein
73 |
74 |
75 | ```
76 |
77 | ## EDirect Query Translation via Debug Flag
78 |
79 | When experimenting with searches in EDirect, it is often helpful to view the interpreted query. This can be accomplished using the `-debug` flag in EDirect 14.4 (thanks to NLM Support for the explanation and tip!):
80 |
81 | ```console
82 |
83 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "hydrogel-based drug delivery" -debug
84 | nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi -retmax 0 -usehistory y -db pubmed -term "hydrogel-based drug delivery" -tool edirect -edirect 14.4 -edirect_os Linux -email name@xx.edu
85 |
86 | pubmed
87 | MCID...
88 | 1
89 | 436
90 | 1
91 | name@xx.edu
92 | Y
93 |
94 |
95 | ```
96 | Next, copy and run the `nquire` command and pipe the results to `xtract`, extracting out the QueryTranslation element:
97 |
98 | ```console
99 |
100 | user@computer:~$ nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi -retmax 0 -usehistory y -db pubmed -term "hydrogel-based drug delivery" | xtract -pattern eSearchResult -element QueryTranslation
101 | hydrogel-based[All Fields] AND ("drug delivery systems"[MeSH Terms] OR ("drug"[All Fields] AND "delivery"[All Fields] AND "systems"[All Fields]) OR "drug delivery systems"[All Fields] OR ("drug"[All Fields] AND "delivery"[All Fields]) OR "drug delivery"[All Fields])
102 |
103 | ```
104 |
105 |
--------------------------------------------------------------------------------
/06_EDirect_Combining_Tools.md:
--------------------------------------------------------------------------------
1 | # Combining EDirect Results with Chemical Depiction and Plotting
2 |
3 | **Notes**
4 |
5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`.
6 | > 2. Replace `name@xx.edu` with your email address.
7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal.
8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/).
9 |
10 | ## EDirect --> Chemical Depiction and Plots
11 |
12 | It is possible to pipe EDirect results into chemical structure viewers as some cheminformatics toolkits can read chemical file formats (e.g., SMILES) directly from standard input.
13 |
14 | ### ChemAxon MarvinView Chemical Depiction
15 |
16 | For [ChemAxon Marvin](https://chemaxon.com/products/marvin), we can pipe EDirect compiled SMILES directly into Marvin View (`mview`). Note that the `-` is the `mview` option to read structures from standard input.
17 |
18 | ```console
19 |
20 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "132427739"[UID] | \
21 | > efetch -format docsum | \
22 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName | \
23 | > mview -
24 |
25 | ```
26 |
27 | 
28 |
29 | If you have multiple molecules to display, you can use the `mview` standard input option `-` along with the `gridbag` option to display the molecules in a matrix:
30 |
31 |
32 | ```console
33 |
34 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "132427739"[UID] | \
35 | > elink -target pccompound -name pccompound_pccompound | \
36 | > efetch -format docsum | \
37 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName | \
38 | > mview --gridbag -
39 |
40 | ```
41 |
42 | 
43 |
44 | _tested with ChemAxon MarvinView version 19.27.0._
45 |
46 | ### Open Babel Chemical Depiction
47 |
48 | One really cool feature of using [Open Babel](https://github.com/openbabel) is the ability to display molecules as ASCII figures directly in the terminal. Below, we pipe the results to Open Babel using the standard input smiles format, `-ismi`, and then output in ascii format, `-oascii`. The `-xh 10` is a resizing option.
49 |
50 | ```console
51 |
52 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "13586"[UID] | \
53 | > efetch -format docsum | \
54 | > xtract -pattern DocumentSummary -element IsomericSmiles | \
55 | > openbabel.obabel -ismi -oascii -xh 10
56 | __
57 | __ \__
58 | _/ \__ \_
59 | __/ \_ \_
60 | O ________/ \_
61 | /
62 | /
63 | \ |
64 | \ /
65 | \/
66 | 1 molecule converted
67 | ```
68 |
69 | This works for multiple molecules too!
70 |
71 | ```console
72 |
73 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "132427739"[UID] | \
74 | > elink -target pccompound -name pccompound_pccompound | \
75 | > efetch -format docsum | \
76 | > xtract -pattern DocumentSummary -element IsomericSmiles | \
77 | > openbabel.obabel -ismi -oascii -xh 10
78 | \
79 | ____
80 | |
81 | __ O_Si__
82 | O\ \__\ |
83 | |\N / \
84 | __O| \__/
85 | /__/ \ OH
86 | | \
87 | /| /
88 | /
89 | / |
90 | \_ | O ____
91 | \__O___ |____
92 | __Si_O N \\/ \
93 | \ /| /
94 | | \___ __N
95 | \__O ____
96 | O / ___\/ \\
97 | / /| /
98 | __ __
99 | \_/
100 | | /
101 | | |
102 | /____ O __Si__/
103 | |/ \ | | |
104 | / \ O\\ ____ _| \
105 | \ ___ | || / \_/ \__
106 | \____/ \\_N \ /
107 | \ | _/ | /
108 | ___
109 | / \/
110 | || |\
111 | \__
112 | \_O ___
113 | |_N/ \ /
114 | O|/ _\__\_/
115 | / /_/ O_Si__
116 | \_
117 | \O |
118 | /
119 | _Si/
120 | \ |_
121 | O____O | \
122 | __|Si_ | | ____
123 | / \_\ __ | |
124 | / | / \____N__|
125 | \O| | / \_
126 | \O/\___ \
127 | \ | ||
128 | \_ /
129 | ___ \
130 | / \ O_Si_|
131 | |__\ \ \ \ |
132 | /\_O \__\ _\
133 | __N / \/ \
134 | \ / / |
135 | | O \__/
136 | \ ___\
137 | ____/
138 |
139 | ...
140 | 24 molecules converted
141 | ```
142 |
143 | You can save depictions in a more classic PNG file using Open Babel with either a single molecule:
144 |
145 | ```console
146 |
147 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "13586"[UID] | \
148 | > efetch -format docsum | \
149 | > xtract -pattern DocumentSummary -element IsomericSmiles CID | \
150 | > openbabel.obabel -ismi -O 13586.png
151 | 1 molecule converted
152 | ```
153 |
154 | 
155 |
156 |
157 | or multiple molecules in a matrix:
158 |
159 | ```console
160 |
161 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "132427739"[UID] | \
162 | > elink -target pccompound -name pccompound_pccompound | \
163 | > efetch -format docsum | \
164 | > xtract -pattern DocumentSummary -element IsomericSmiles CID | \
165 | > openbabel.obabel -ismi -O 132427739_similar.png -xp 1400
166 | 24 molecules converted
167 | ```
168 |
169 | 
170 |
171 | _Tested with Open Babel v3.0.0 installed from Snap. I did receive a Font Configuration error when saving the PNG files, however, the conversion seemed to work fine._
172 |
173 | ### gnuplot Data plotting
174 |
175 | [gnuplot](http://www.gnuplot.info/) is a command-line graphing program that allows plotting data from standard input. In gnuplot, there is an option called "dumb terminal" that creates plots using ASCII characters directly in the terminal window, which is convenient for initial analysis of compiled EDirect data. For example, here is some data related to the number of *J Cheminform* articles indexed in PubMed by publication date:
176 |
177 | ```console
178 |
179 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Cheminform[JOUR]" | \
180 | > efetch -format docsum | \
181 | > xtract -pattern DocumentSummary -element PubDate | \
182 | > cut -d " " -f 1 | \
183 | > sort-uniq-count-rank | \
184 | > sort -k2
185 | 22 2009
186 | 12 2010
187 | 54 2011
188 | 39 2012
189 | 52 2013
190 | 71 2014
191 | 78 2015
192 | 71 2016
193 | 67 2017
194 | 68 2018
195 | 56 2019
196 | ```
197 | We can pipe this data directly to gnuplot. In the below script, `set term dumb` is the gnuplot option to create an ASCII plot, `-` sets the data input to standard input instead of a file, `using 2:1` sets the second column as the x-axis, and the first column as the y-axis, `with boxes` creates a box plot, and `notitle` removes the plot legend:
198 |
199 | ```console
200 |
201 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Cheminform[JOUR]" | \
202 | > efetch -format docsum | \
203 | > xtract -pattern DocumentSummary -element PubDate | \
204 | > cut -d " " -f 1 | \
205 | > sort-uniq-count-rank | \
206 | > sort -k2 | \
207 | > gnuplot -e "set term dumb; plot '-' using 2:1 with boxes notitle"
208 |
209 |
210 | 80 +---------------------------------------------------------------------+
211 | | + + + ******* + + |
212 | | * * |
213 | 70 |-+ ******* ******* ******* +-|
214 | | * * * ****** * |
215 | | * * * * * * |
216 | 60 |-+ * * * * * * +-|
217 | | ****** * * * * * ******* |
218 | | * * ******* * * * * * * |
219 | 50 |-+ * * * * * * * * * *+-|
220 | | * * * * * * * * * * |
221 | 40 |-+ * * * * * * * * * *+-|
222 | | * ******* * * * * * * * |
223 | | * * * * * * * * * * |
224 | 30 |-+ * * * * * * * * * *+-|
225 | | * * * * * * * * * * |
226 | | * * * * * * * * * * |
227 | 20 |-+******* * * * * * * * * * *+-|
228 | | * * * * * * * * * * * * |
229 | | * ******* * + * * + * * + * * + * * |
230 | 10 +---------------------------------------------------------------------+
231 | 2008 2010 2012 2014 2016 2018 2020
232 |
233 | ```
234 |
235 | _Tested with gnuplot-x11 5.2.8._
236 |
237 |
238 |
--------------------------------------------------------------------------------
/04_EDirect_PubChem_Recipes.md:
--------------------------------------------------------------------------------
1 | # PubChem EDirect Recipes
2 |
3 | **Notes**
4 |
5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`.
6 | > 2. Replace `name@xx.edu` with your email address.
7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal.
8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/).
9 |
10 | ## PubChem EDirect
11 |
12 | ### Search PubChem Compound via InChIKey and Retrieve Data
13 |
14 | In the below script, we first query the PubChem Compound database (`pccompound`) for "NJTXJDYZPQNTSM-WMZOPIPTSA-N" in the InChIKey (`[IKEY]`) field. Next, the record is retrieved in XML docsum and several properties are extracted with the `xtract` function including the IsomericSmiles, CID, InChIKey, and IUPACName.
15 |
16 | ```console
17 |
18 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "NJTXJDYZPQNTSM-WMZOPIPTSA-N"[IKEY] | \
19 | > efetch -format docsum | \
20 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName
21 | C[C@]12CCC(=O)C=C1CCC[C@@H]2OC(=O)C3=CC=CC=C3 11044292 NJTXJDYZPQNTSM-WMZOPIPTSA-N [(1S,8aS)-8a-methyl-6-oxo-1,2,3,4,7,8-hexahydronaphthalen-1-yl] benzoate
22 | ```
23 | _tested on 2021.01.27, EDirect 14.4, total count was 1._
24 |
25 | ### Search PubChem Compound with a list of CIDs and Retrieve Data
26 |
27 | If we have a small list of PubChem Compound Identifiers (CIDs) and need to retrieve specific data for each CID, we can write a for loop directly in the terminal. Note that in the below Bash script, I added a sleep of one second within the loop in an effort to not overload the NCBI servers.
28 |
29 | ```console
30 |
31 | user@computer:~$ for myCID in \
32 | > "146021325" \
33 | > "11068043" \
34 | > "11615487" \
35 | > "10056179" \
36 | > "169731"
37 | > do
38 | > esearch -email name@xx.edu -db pccompound -query "$myCID[UID]" |
39 | > efetch -format docsum |
40 | > xtract -pattern DocumentSummary -lbl "$myCID" -element IsomericSmiles InChIKey MolecularFormula MolecularWeight
41 | > sleep 1
42 | > done
43 | 146021325 CN1C(=CN=C1Cl)C(/C=C/C2=CC=CC=C2)(C3=CC=CC=C3)O AKSFJXCUMHAJKP-OUKQBFOZSA-N C19H17ClN2O 324.800
44 | 11068043 CC(C)[Si](C(C)C)(C(C)C)OC(CCCC1=CCC=CC1)CC=C JDKBJINLKCZQNX-UHFFFAOYSA-N C22H40OSi 348.600
45 | 11615487 CC1=CC(=C(C=C1)NC(=O)C(C)(C)C)OC WMXHBZHMNCGLQQ-UHFFFAOYSA-N C13H19NO2 221.290
46 | 10056179 CC(=O)N1CN2C3=CC=CC=C3C(=C2C4=CC=CC=C41)C(C(=O)NCC[Se]C5=CC=CC=C5)O MJWHOMSECISGAK-UHFFFAOYSA-N C27H25N3O3Se 518.500
47 | 169731 C1=CC=C2C(=C1)C=C(N2)CC#N RORMSTAFXZRNGK-UHFFFAOYSA-N C10H8N2 156.180
48 | ```
49 | _tested on 2021.01.27, EDirect 14.4, total count was 5 (as expected in the for loop)._
50 |
51 | ### Retrieve Pre-Computed Linked Similar Compounds
52 |
53 | In the below script, we use the `esearch` function to query the PubChem Compound database (`pccompound`) for CID 11044292 within the Compound ID field (`[uid]`). The `esearch` results are then piped to `elink` finding related PubChem Compounds via the Entrez link `pccompound_pccompound`.
54 |
55 | ```console
56 |
57 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "11044292"[UID] | \
58 | > elink -target pccompound -name pccompound_pccompound | \
59 | > efetch -format docsum | \
60 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName
61 | ...
62 | ...
63 | CC(C1=CC=CC=C1)OC(=O)C2=CC=C(C=C2)C(C)(C)C 152679150 ZNDWBMXQOLFTJJ-UHFFFAOYSA-N 1-phenylethyl 4-tert-butylbenzoate
64 | CCCCC1(CCC(C(C1)OC(=O)C2=CC=CC=C2)C(C)C)C 152242148 WDOHMQDXGOVKBZ-UHFFFAOYSA-N (5-butyl-5-methyl-2-propan-2-ylcyclohexyl) benzoate
65 | CCC1=CC=CC=C1C(=O)OC2C=CC(=O)CC2(C)C 150893175 KYLDSGZLUQKTGC-UHFFFAOYSA-N (6,6-dimethyl-4-oxocyclohex-2-en-1-yl) 2-ethylbenzoate
66 | CC1CCC(C(CC1=O)(C)C)OC(=O)C2=CC=CC=C2 150335011 GQMHTGLOWBLGSB-UHFFFAOYSA-N (2,2,5-trimethyl-4-oxocycloheptyl) benzoate
67 | ...
68 | ...
69 | ```
70 | _tested on 2021.01.27, EDirect 14.4, total count was 238._
71 |
72 | ### Find Compounds with Specific Attributes
73 |
74 | There are a variety of methods to limit results and find compounds with specific attributes in PubChem Compound. The below script, for example, uses the `efilter` function to limit the `elink` similarity results to compounds with active assays using the query "pccompound_pcassay_active" in the filter (`[FILT]`) field:
75 |
76 | ```console
77 |
78 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "11044292"[UID] | \
79 | > elink -target pccompound -name pccompound_pccompound | \
80 | > efilter -query "pccompound_pcassay_active"[FILT] | \
81 | > efetch -format docsum | \
82 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey IUPACName
83 | CC1(CC(=O)C=C(C1=O)C2=CC=C(C=C2)COC(=O)C3=CC=CC=C3)C 46904830 DSGHHJWVVKGFNI-UHFFFAOYSA-N [4-(5,5-dimethyl-3,6-dioxocyclohexen-1-yl)phenyl]methyl benzoate
84 | CC1=CC[C@H](/C(=C\[C@@H](C(CCC1)(C)C)OC(=O)C)/C)OC(=O)C2=CC=CC=C2 46886858 WZEJCPHMOSKQHH-FNXKFTHESA-N [(1R,2Z,4S)-4-acetyloxy-2,5,5,9-tetramethylcycloundeca-2,9-dien-1-yl] benzoate
85 | CC1=C2[C@H]([C@@H]([C@@]3(C=CC(=O)C(=C)[C@H]3C=C2CC1=O)C)OC(=O)C)OC(=O)C4=CC=CC=C4 44585423 DSOLMHLLLLRMOJ-BNWQNPBSSA-N [(4R,5R,5aR,9aS)-5-acetyloxy-3,5a-dimethyl-9-methylidene-2,8-dioxo-1,4,5,9a-tetrahydrobenzo[g]azulen-4-yl] benzoate
86 | CCOC(=O)C1=CC=CC(=C1)C2=CC(=O)CC(C2)(C)C 44143998 BBEWYQFSZQEXCH-UHFFFAOYSA-N ethyl 3-(5,5-dimethyl-3-oxocyclohexen-1-yl)benzoate
87 | CCC1=C(C(C(OC1=O)C2=CC=CC=C2)(C)C)OC(=O)C3=CC=CC=C3 2893657 FEWXNYDYEILDTL-UHFFFAOYSA-N (5-ethyl-3,3-dimethyl-6-oxo-2-phenyl-2H-pyran-4-yl) benzoate
88 | CC1=C(C(C(OC1=O)C2=CC=CC=C2)(C)C)OC(=O)C3=CC=CC=C3 569453 UXXMZHQXFIACTH-UHFFFAOYSA-N (3,3,5-trimethyl-6-oxo-2-phenyl-2H-pyran-4-yl) benzoate
89 | ```
90 |
91 | _tested on 2021.01.27, EDirect 14.4, total count was 6._
92 |
93 | Another filtering method could be to add a specific property attribute range, such as compounds containing 8 to 12 rotatable bonds (`[RBC]`):
94 |
95 | ```console
96 |
97 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "11044292"[UID] | \
98 | > elink -target pccompound -name pccompound_pccompound | \
99 | > efilter -query "8:12"[RBC] | \
100 | > efetch -format docsum | \
101 | > xtract -pattern DocumentSummary -element IsomericSmiles CID RotatableBondCount
102 | CCC(CC)(CCCCCOC(=O)C1=CC=CC=C1)C(=O)C2=CC=CC=C2 153964625 12
103 | CCCCCCCC1CCC(CC1)OC(=O)C2=CC=CC=C2 153717993 9
104 | CCCCCCC1CCC(CC1)OC(=O)C2=CC=CC=C2 153717992 8
105 | CC(C(=O)CCCC(C)(C)CCC(C)(C)C)OC(=O)C1=CC=CC=C1 153334776 11
106 | COC(=O)CCCCC[C@H](C1=CC=CC=C1)OC(=O)C2=CC=CC=C2 145778504 11
107 | CCC(C(CC(C)(C)C)OC(=O)C1=CC=CC=C1)OC(=O)C2=CC=CC=C2 142273534 10
108 | ...
109 | ...
110 | ```
111 |
112 | _tested on 2021.01.27, EDirect 14.4, total count was 57._
113 |
114 | It is also possible to query PubChem Compound directly for compounds with specific attributes (i.e., without the use of `efilter`). However, you will likely need to be very specific in order to retrieve a reasonable number of records. For example, in the the below script, PubChem Compound was queried for compounds containing Uranium in the element field (`[ELMT]`) and 3:5 defined chiral atoms in the AtomChiralDefCount field (`[ACDC]`):
115 |
116 |
117 | ```console
118 |
119 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "(\"U\"[ELMT]) AND \"3\"[ACDC]:\"5\"[ACDC]" | \
120 | > efetch -format docsum | \
121 | > xtract -pattern DocumentSummary -element IsomericSmiles CID MolecularFormula AtomChiralDefCount
122 | C[C@@H]1CC[C@H]([C@@]2([C@]1(CCC(=C2)C)C)C)[C-]=C.O.[U] 154676185 C16H27OU- 4
123 | C[CH-]O[C@H]1[C@@H]([C@H]([C@@H](O[C@@H]1[CH2-])O)O)C.[U+2] 154572041 C9H16O4U 5
124 | [CH3-].CNCCO[C@@H]1CNC[C@@H](C1C2=CC=C(C=C2)O[C@H]3CCN(C3)C4=CC(=CC=C4)F)OCC5=CC6=C(C=C5)OCCN6CC[CH2-].[U+2] 154550507 C37H49FN4O4U 3
125 | C[C@H]1CC=C(CN1C)C2=CSC(=N2)SC3=C(N4[C-]([C@H]3C)[C@H](C4=O)[C@@H](C)O)C(=O)O.[U] 154536644 C20H24N3O4S2U- 4
126 | CC1CC2C3[C@H](C=C4C[C-](CC[C@@]4(C3CC[C@@]2(C15OCCO5)C)C)OC[CH2-])O.[U+2] 154528690 C24H36O4U 3
127 | COC1=C(C=C2C(=C1)C(=O)N3CC(=C)C[C@H]3[C@@H]([N-]2)O)OCCCCCOC4=C(C=C5C(=C4)N=C[C@@H]6CC(=C)CN6C5=O)OC.[U] 153695434 C33H37N4O7U- 3
128 | ...
129 | ...
130 | ```
131 |
132 | _tested on 2021.01.27, EDirect 14.4, total count was 1938._
133 |
134 | Note that I escaped (`\`) the internal quotes in the above query. Sometimes this is not necessary (in my experience it depends on the NCBI database). If you are unsure how the query is being interpreted, run the `esearch` function with the `-debug` option. You can then use `nquire` with the link output and extract out the parsed query:
135 |
136 |
137 | ```console
138 |
139 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "(\"U\"[ELMT]) AND \"3\"[ACDC]:\"5\"[ACDC]" -debug
140 | ...
141 | ...
142 | user@computer:~$ nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi -retmax 0 -usehistory y -db pccompound -term "(\"U\"[ELMT]) AND \"3\"[ACDC]:\"5\"[ACDC]" | \
143 | > xtract -pattern eSearchResult -element QueryTranslation
144 | "U"[ELMT] AND "3"[AtomChiralDefCount] : "5"[AtomChiralDefCount]
145 | ```
146 |
147 | ### Find Number of Compounds by Create Date
148 |
149 | We can use the Create Date field `[CDAT]` in PubChem Compound to search for compound records created on a specific date. The `esearch` results are then piped into `efetch` to retrieve the XML docsum compound records. Next, the `xtract` function is used to extract out the CreateDate. The extracted data is then piped into the EDirect alias function `sort-uniq-count-rank`, which sorts the data by highest frequency. Finally, I added an additional `sort` command, to sort by date (`-k2,2` for second column), instead of number of compounds.
150 |
151 | ```console
152 |
153 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "2020/05/01"[CDAT]:"2020/05/31"[CDAT] | \
154 | > efetch -format docsum | \
155 | > xtract -pattern DocumentSummary -element CreateDate | \
156 | > sort-uniq-count-rank | \
157 | > sort -k2,2
158 | 448 2020/05/01 00:00
159 | 20 2020/05/02 00:00
160 | 45 2020/05/04 00:00
161 | 3 2020/05/05 00:00
162 | 67 2020/05/06 00:00
163 | 32 2020/05/07 00:00
164 | 97 2020/05/08 00:00
165 | 7827 2020/05/11 00:00
166 | 42 2020/05/12 00:00
167 | 573 2020/05/13 00:00
168 | 67 2020/05/14 00:00
169 | 75 2020/05/15 00:00
170 | 136 2020/05/16 00:00
171 | 69 2020/05/18 00:00
172 | 52 2020/05/19 00:00
173 | 66 2020/05/20 00:00
174 | 53 2020/05/21 00:00
175 | 790 2020/05/22 00:00
176 | 36 2020/05/23 00:00
177 | 2 2020/05/24 00:00
178 | 5 2020/05/25 00:00
179 | 530 2020/05/26 00:00
180 | 63 2020/05/27 00:00
181 | 169 2020/05/28 00:00
182 | 9432 2020/05/29 00:00
183 | 26 2020/05/30 00:00
184 | 2 2020/05/31 00:00
185 | ```
186 | _tested on 2021.01.27, EDirect 14.4._
187 |
188 | If we want to obtain the number of compounds in PubChem by create date over a longer period of time (e.g., several months to years), it probably does not make sense to use `efetch`, as the number of compounds will be hundreds of thousands or even millions. Trying to download all of the docsums for this many record likely won't work. As an alternative, we can use `esearch` in a for loop, and extract out the Count value from the `esearch` ENTREZ_DIRECT query XML summary. For example, if we wanted the number of compounds created in PubChem for 2019 by month:
189 |
190 | ```console
191 |
192 | user@computer:~$ for date in \
193 | > "2019/01" \
194 | > "2019/02" \
195 | > "2019/03" \
196 | > "2019/04" \
197 | > "2019/05" \
198 | > "2019/06" \
199 | > "2019/07" \
200 | > "2019/08" \
201 | > "2019/09" \
202 | > "2019/10" \
203 | > "2019/11" \
204 | > "2019/12"
205 | > do
206 | > esearch -email name@xx.edu -db pccompound -query "$date[CDAT]" |
207 | > xtract -pattern ENTREZ_DIRECT -lbl "$date" -element Count
208 | > sleep 1
209 | > done
210 | 2019/01 1843612
211 | 2019/02 7970
212 | 2019/03 219313
213 | 2019/04 469125
214 | 2019/05 324068
215 | 2019/06 64691
216 | 2019/07 302326
217 | 2019/08 154938
218 | 2019/09 119817
219 | 2019/10 236148
220 | 2019/11 308727
221 | 2019/12 5444411
222 | ```
223 | _tested on 2021.01.27, total count was 12 (as expected in the for loop)._
224 |
225 | In the above for loop bash script, I added a sleep of one second between each `esearch` query in an effort to not overload the NCBI servers.
226 |
227 | ### Find Related PubChem Substances (same)
228 |
229 | To find the number of related PubChem substances for a PubChem compound, we can use `elink` with Entrez link `pccompound_pcsubstance_same`:
230 |
231 | ```console
232 |
233 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "14333"[UID] | \
234 | > elink -target pcsubstance -name pccompound_pcsubstance_same | \
235 | > xtract -pattern ENTREZ_DIRECT -element Count
236 | 51
237 | ```
238 |
239 | And then to retrieve information about the PubChem substances, we can pipe these results into `efetch` and `xtract`, to extract out specific information such as the SID, CurrentSourceName, SourceID, and DepositDate:
240 |
241 | ```console
242 |
243 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "14333"[UID] | \
244 | > elink -target pcsubstance -name pccompound_pcsubstance_same | \
245 | > efetch -format docsum | \
246 | > xtract -pattern DocumentSummary -element SID CurrentSourceName SourceID DepositDate
247 | ...
248 | 439452693 THE BioTek bt-308998 2020/12/31 00:00
249 | 438657242 Alfa Chemistry ACM1132394 2020/12/09 00:00
250 | 438538915 3WAY PHARM INC SWOT-0105728 2020/12/08 00:00
251 | 435642079 Chem-Space.com Database CSSB00032005459 2020/11/21 00:00
252 | 410573132 Google Patents 15237363 2020/08/12 00:00
253 | 404911410 The University of Alabama Libraries UALIB-1927 2020/03/21 00:00
254 | 403383863 PATENTSCOPE (WIPO) ORQWTLCYLDRDHK-UHFFFAOYSA-N 2020/01/24 00:00
255 | 387135315 NORMAN Suspect List Exchange ORQWTLCYLDRDHK-UHFFFAOYSA-N 2019/11/22 00:00
256 | 386279116 Wiley 140582 2019/10/23 00:00
257 | ...
258 | ...
259 | ```
260 | _tested on 2021.01.27, total count was 51._
261 |
262 |
--------------------------------------------------------------------------------
/05_EDirect_PubMed_Recipes.md:
--------------------------------------------------------------------------------
1 | # PubMed EDirect Recipes
2 |
3 | **Notes**
4 |
5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`.
6 | > 2. Replace `name@xx.edu` with your email address.
7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal.
8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/).
9 |
10 | ## PubMed EDirect
11 |
12 | ### Search PubMed by Keyword and/or MeSH and Retrieve References
13 |
14 | We can use the EDirect function `esearch` to query PubMed. However, before trying to retrieve any of the results with `efetch`, it is a good idea to check that the count range is manageable (e.g., on the order of several thousand). In addition, see the [EDirect Query Translation Instructions](https://github.com/vfscalfani/EDirectChemInfo/blob/master/01_EDirect_Intro.md#edirect-query-translation-via-debug-flag) for how to use the `-debug` option to view how your query is interpreted in PubMed.
15 |
16 | ```console
17 |
18 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "hydrogel-based drug delivery"
19 |
20 | pubmed
21 | MCID...
22 | 1
23 | 436
24 | 1
25 | name@xx.edu
26 |
27 | ```
28 |
29 | After deciding if the `esearch` query is appropriate, we can start to pipe the `esearch` results into other EDirect functions. For example, the below script first uses `esearch` to query PubMed for "hydrogel-based drug delivery", and then these results are piped (`|`) into `efetch` to retrieve the results as XML format. The `efetch` results are then piped to the `xtract` function where several bibliographic elements of the PubMed XML records are extracted into a table:
30 |
31 | ```console
32 |
33 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "hydrogel-based drug delivery" | \
34 | > efetch -format xml | \
35 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
36 | > Author/Initials ArticleTitle ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
37 | 33424262 El-Masry SM Hydrogel-based matrices for controlled drug delivery of etamsylate: Prediction of in-vivo plasma profiles. Saudi Pharm J 2020 28 12 1704-1718
38 | 33398321 Chen W Magnetically actuated intelligent hydrogel-based child-parent microrobots for targeted drug delivery. J Mater Chem B 2021
39 | 33396629 Dehshahri A New Horizons in Hydrogels for Methotrexate Delivery. Gels 2020 7 1
40 | 33387892 Amiri M Hydrogel beads-based nanocomposites in novel drug delivery platforms: Recent trends and developments. Adv Colloid Interface Sci 2020 288 102316
41 | 33378390 Kloepping KC Triphenylphosphonium derivatives disrupt metabolism and inhibit melanoma growth in vivo when delivered via a thermosensitive hydrogel. PLoS One 2020 15 12 e0244540
42 | 33359482 Agarwal P Structural characterization and developability assessment of sustained release hydrogels for rapid implementation during preclinical studies. Eur J Pharm Sci 2021 158 105689
43 | ...
44 | ...
45 | ```
46 |
47 | _tested on 2021.01.27, EDirect 14.4, total count was 436._
48 |
49 |
50 | Note that if we want to extract out the DOIs, we can use the `xtract` `-block` option like this:
51 |
52 | ```console
53 |
54 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "hydrogel-based drug delivery" | \
55 | > efetch -format xml | \
56 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
57 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \
58 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId
59 | 33424262 El-Masry SM Saudi Pharm J 2020 28 12 1704-1718 https://doi.org/10.1016%2Fj.jsps.2020.10.016
60 | 33398321 Chen W J Mater Chem B 2021 https://doi.org/10.1039%2Fd0tb02384a
61 | 33396629 Dehshahri A Gels 2020 7 1 https://doi.org/10.3390%2Fgels7010002
62 | 33387892 Amiri M Adv Colloid Interface Sci 2020 288 102316 https://doi.org/10.1016%2Fj.cis.2020.102316
63 | 33378390 Kloepping KC PLoS One 2020 15 12 e0244540 https://doi.org/10.1371%2Fjournal.pone.0244540
64 | 33359482 Agarwal P Eur J Pharm Sci 2021 158 105689 https://doi.org/10.1016%2Fj.ejps.2020.105689
65 | ...
66 | ...
67 | ```
68 | _tested on 2021.01.27, EDirect 14.4, total count was 436._
69 |
70 |
71 | There is a lot going on with the last line of code that extracts out the DOIs: `-block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId`. Let's look at part of a PubMed XML file to help interpret what is going on here:
72 |
73 | ```console
74 | ...
75 | ...
76 |
77 | 17630804
78 | 10.1021/jo071035l
79 |
80 | ...
81 | ...
82 | ```
83 |
84 | The `-block` option limits the extraction to a particular section of the XML, in this case the `ArticleId` tags. The `@` defines the desired IdType `doi` element attribute. Finally, the `-doi` is an `xtract` string option that prefixes https://doi.org/ before the extracted ArticleId doi. There is a more thorough explanation of `-block` and extracting out the DOIs with the `-block` option in the [NLM Insider's Guide to Accessing NLM Data Part 4](https://dataguide.nlm.nih.gov/classes/edirect-for-pubmed/samplecode4.html#output-a-list-of-pmids-and-corresponding-dois) and [Entrez Programming Utilities Help Manual](https://www.ncbi.nlm.nih.gov/books/NBK179288/).
85 |
86 |
87 | Similarly to the above script, we can specify particular fields to query within PubMed. The below script searches for "ionic liquids" in the MeSH term field (`[MESH]`) and "imidazolium" in all fields. Note that the internal quotes are escaped (`\`), which is sometimes necessary for the query to be interpreted correctly when using phrases.
88 |
89 |
90 | ```console
91 |
92 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "\"ionic liquids\"[MESH] AND imidazolium" | \
93 | > efetch -format xml | \
94 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
95 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \
96 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId
97 | 33396149 Hu LX Ecotoxicol Environ Saf 2021 208 111629 https://doi.org/10.1016%2Fj.ecoenv.2020.111629
98 | 33346267 Kaur M Phys Chem Chem Phys 2021 23 1 320-328 https://doi.org/10.1039%2Fd0cp04513f
99 | 33253998 Tashakkori P J Chromatogr A 2021 1635 461741 https://doi.org/10.1016%2Fj.chroma.2020.461741
100 | 33142384 Ren YM Zhonghua Lao Dong Wei Sheng Zhi Ye Bing Za Zhi 2020 38 10 767-769 https://doi.org/10.3760%2Fcma.j.cn121094-20191010-00483
101 | 33135708 Kumar S Phys Chem Chem Phys 2020 22 43 25255-25263 https://doi.org/10.1039%2Fd0cp04014b
102 | 32822985 Zuo L J Chromatogr A 2020 1628 461446 https://doi.org/10.1016%2Fj.chroma.2020.461446
103 | 32711338 Zunita M Bioresour Technol 2020 315 123864 https://doi.org/10.1016%2Fj.biortech.2020.123864
104 | ...
105 | ...
106 | ```
107 | _tested on 2021.01.27, EDirect 14.4, total count was 1000._
108 |
109 |
110 | ### Calculate the Most Frequent Journal Titles For a PubMed Search
111 |
112 | The below script uses `esearch` to query PubMed for "Artificial Intelligence" in the `[MESH]` field and "drug discovery" in the `[ALL]` field. The records are then retrieved as XML format using the `efetch` function, followed by extracting out the journal names (`IsoAbbreviation`) using `xtract`. The `xtract` results are then piped to the EDirect alias function `sort-uniq-count-rank`, which sorts the data by highest frequency:
113 |
114 | ```console
115 |
116 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "\"Artificial Intelligence\"[MESH] AND \"drug discovery\"[ALL]" | \
117 | > efetch -format xml | \
118 | > xtract -pattern PubmedArticle -element ISOAbbreviation | \
119 | > sort-uniq-count-rank
120 | 169 J Chem Inf Model
121 | 53 BMC Bioinformatics
122 | 49 PLoS One
123 | 40 Bioinformatics
124 | 39 Methods Mol Biol
125 | 33 Mol Pharm
126 | 32 Molecules
127 | 29 Sci Rep
128 | 28 Drug Discov Today
129 | 28 J Comput Aided Mol Des
130 | 24 J Med Chem
131 | 23 Expert Opin Drug Discov
132 | 23 Int J Mol Sci
133 | 19 Curr Top Med Chem
134 | 18 Mol Inform
135 | 17 Future Med Chem
136 | 16 Nucleic Acids Res
137 | 15 Nature
138 | 15 PLoS Comput Biol
139 | 14 IEEE/ACM Trans Comput Biol Bioinform
140 | ...
141 | ...
142 | ```
143 | _tested on 2021.01.27, EDirect 14.4._
144 |
145 | ### Calculate The Frequency of Author Publications for a University Department in PubMed
146 |
147 | The below script uses `esearch` to query PubMed for ("university of alabama" AND tuscaloosa) in the affiliation field (`[AFFL]`). Tuscaloosa was added to limit the number of retrieved records associated with The University of Alabama at Birmingham and The University of Alabama at Huntsville. Another approach could have been to use the NOT operator: `"(university of alabama[AFFL]) NOT (birmingham[AFFL] OR huntsville[AFFL])"`. However, the latter approach may eliminate any collaborative articles with these institutions (affiliation searches are challenging!). Next, the results were retrieved as XML using `efetch`, followed by piping these results to `xtract` to extract out the publication year (`PubDate/Year`) and sort by frequency with `sort-uniq-count-rank`. Note that a conditional statement was used in the `xtract` pattern to only extract results from articles if the affiliation contains both `chemistry` and `tuscaloosa`. The thought here was that this would limit the results (mostly) to author publications from The University of Alabama (Tuscaloosa) Department of Chemistry:
148 |
149 | ```console
150 |
151 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "(university of alabama[AFFL] AND tuscaloosa[AFFL])" | \
152 | > efetch -format xml | \
153 | > xtract -pattern PubmedArticle -if Affiliation -contains chemistry -and Affiliation -contains tuscaloosa -element PubDate/Year | \
154 | > sort-uniq-count-rank
155 | 65 2015
156 | 64 2020
157 | 59 2017
158 | 53 2018
159 | 49 2019
160 | 45 2016
161 | 41 2014
162 | 35 2013
163 | 30 2012
164 | 28 2008
165 | 26 2006
166 | 23 2007
167 | 23 2010
168 | 20 2004
169 | 19 2003
170 | 18 2009
171 | 17 1999
172 | 17 2001
173 | ...
174 | ...
175 | ```
176 |
177 | _tested on 2021.01.27, EDirect 14.4._
178 |
179 | If instead we want to know individual Author numbers in PubMed instead of total publications by year, we can change the `xtract` pattern:
180 |
181 | ```console
182 |
183 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "(university of alabama[AFFL] AND tuscaloosa[AFFL])" | \
184 | > efetch -format xml | \
185 | > xtract -pattern Author -if Affiliation -contains chemistry -and Affiliation -contains tuscaloosa -element LastName Initials | \
186 | > sort-uniq-count-rank
187 | 108 Dixon DA
188 | 53 Vasiliu M
189 | 34 Vincent JB
190 | 32 Rogers RD
191 | 22 Bowman MK
192 | 20 Fang Z
193 | 14 Grant DJ
194 | 14 Thanthiriwatte KS
195 | 13 Cassady CJ
196 | 12 Frantom PA
197 | 11 Chen M
198 | 11 Kelley SP
199 | 11 Shamshina JL
200 | 10 Kispert LD
201 | 10 Metzger RM
202 | 10 Papish ET
203 | 9 Gerlach DL
204 | 9 Li S
205 | 9 Timkovich R
206 | 8 Matus MH
207 | ...
208 | ...
209 | ```
210 | _tested on 2021.01.27, EDirect 14.4._
211 |
212 |
213 | Let's take a closer look at the conditional `xtract` pattern specifying to extract data only if the affiliation contains chemistry and tuscaloosa:
214 |
215 | ```console
216 |
217 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "(university of alabama[AFFL] AND tuscaloosa[AFFL])" | \
218 | > efetch -format xml | \
219 | > xtract -pattern Author -if Affiliation -contains chemistry -and Affiliation -contains tuscaloosa -element LastName Initials Affiliation
220 | Rowe SJ Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA.
221 | Mecaskey RJ Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA.
222 | Nasef M Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA.
223 | Talton RC Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA.
224 | Sharkey RE Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA.
225 | Halliday JC Department of Chemistry and Biochemistry, University of Alabama, Tuscaloosa, Alabama, USA.
226 | ...
227 | ...
228 | ```
229 | _tested on 2021.01.27, EDirect 14.4._
230 |
231 |
232 | With a quick look at the ~1000 results, it seemed like we extracted out the intended data, however, I did notice some false positive results. One example was an article from The University of Alabama (Tuscaloosa) Department of Biological Sciences with an external collaborator having "Chemistry" in the Institution name. Other errors could be what we unintentionally excluded such as any records that do not have Tuscaloosa in the affiliation field (i.e., only a partial address or zip code). These type of affiliation searches are tricky, so test often and think through the results carefully.
233 |
234 |
235 | ### Retrieve Cites and Cited References in PubMed
236 |
237 | The `elink` function can retrieve associated cites and cited references for PubMed records. Cites are the available references in the article (i.e. bibliography list) and cited are references to the article. Not all PubMed articles have associated citation reference data. The available reference data are from the [NIH Open Citation Collection Dataset](https://pubmed.ncbi.nlm.nih.gov/31600197/).
238 |
239 | To retrieve the number of cites for a PubMed article, we can use the `elink` function, followed by `xtract` to extract out the Count element:
240 |
241 | ```console
242 |
243 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29978703[PMID]" | \
244 | > elink -cites | \
245 | > xtract -pattern ENTREZ_DIRECT -element Count
246 | 11
247 | ```
248 | _tested on 2021.01.27, EDirect 14.4._
249 |
250 |
251 | Add `efetch` to your script if you want to retrieve the records:
252 |
253 | ```console
254 |
255 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29978703[PMID]" | \
256 | > elink -cites | \
257 | > efetch -format xml | \
258 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
259 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \
260 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId
261 | 29382051 Stefanachi A Molecules 2018 23 2 https://doi.org/10.3390%2Fmolecules23020250
262 | 27709885 Stempel E Acc Chem Res 2016 49 11 2390-2402 https://doi.org/10.1021%2Facs.accounts.6b00265
263 | 26661053 James MJ Chemistry 2016 22 9 2856-81 https://doi.org/10.1002%2Fchem.201503835
264 | 26313158 Liu BY Org Lett 2015 17 17 4380-3 https://doi.org/10.1021%2Facs.orglett.5b02230
265 | 22969063 Han X Angew Chem Int Ed Engl 2012 51 41 10390-3 https://doi.org/10.1002%2Fanie.201205238
266 | 18620434 Martin R Acc Chem Res 2008 41 11 1461-73 https://doi.org/10.1021%2Far800036s
267 | ...
268 | ...
269 | ```
270 |
271 | _tested on 2021.01.27, EDirect 14.4._
272 |
273 | Getting the cited records only requires changing `-cites` to `-cited`:
274 |
275 | ```console
276 |
277 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29978703[PMID]" | \
278 | > elink -cited | \
279 | > efetch -format xml | \
280 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
281 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \
282 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId
283 | 32537619 Fernandes RA Chem Commun (Camb) 2020 56 61 8569-8590 https://doi.org/10.1039%2Fd0cc02659j
284 | 32317969 Lautié E Front Pharmacol 2020 11 397 https://doi.org/10.3389%2Ffphar.2020.00397
285 | 30707497 Ivanova OA Chem Rec 2019 https://doi.org/10.1002%2Ftcr.201800166
286 | 30259622 Tymann D Angew Chem Int Ed Engl 2018 57 47 15553-15557 https://doi.org/10.1002%2Fanie.201808578
287 | ```
288 | _tested on 2021.01.27, EDirect 14.4._
289 |
290 | We can answer some interesting questions with the NIH Open Citation Collection Data. For example, I noticed that the PubMed XML records for articles in *J Cheminform* contain a reference list for articles in PubMed. So, theoretically, if we query PubMed for *J Cheminform*, extract out all of the references, and sort these by frequency, we should get the most cited references in *J Cheminform* article bibliographies (caveat: in the available PubMed citation data).
291 |
292 | In the below script, the `xtract` pattern creates a new line for each extracted reference citation PMID from the ArticleId with pubmed attribute field:
293 |
294 | ```console
295 |
296 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Cheminform[JOUR]" | \
297 | > efetch -format xml | \
298 | > xtract -pattern Reference -if ArticleId@IdType -equals pubmed -element ArticleId | \
299 | > sort-uniq-count-rank | \
300 | > head -n 20
301 | 90 20426451
302 | 83 21982300
303 | 80 21948594
304 | 65 12653513
305 | 47 11259830
306 | 40 10592235
307 | 38 16796559
308 | 38 27899562
309 | 37 26400175
310 | 36 8709122
311 | 33 17154509
312 | 33 19498078
313 | 30 16381955
314 | 29 21059682
315 | 29 23343401
316 | 28 15667143
317 | 28 24214965
318 | 27 17932057
319 | 27 21425294
320 | 27 22587354
321 | ```
322 | _tested on 2021.01.27, EDirect 14.4._
323 |
324 | Note that when quickly viewing all of the sorted results (~10,000 lines), I did see maybe a 100 or so entries with two PMIDs per line or a DOI and a PMID. Since we specifically defined the pubmed IdType attribute, it is not exactly clear to me yet why there would be some extra data in there. Perhaps it is a mistake or inconsistency in the *J Cheminform* PubMed XML records.
325 |
326 |
327 | We can take a quick look at the top 10 cited references using a for loop::
328 |
329 | ```console
330 | user@computer:~$ for refs in \
331 | > "20426451" \
332 | > "21982300" \
333 | > "21948594" \
334 | > "12653513" \
335 | > "11259830" \
336 | > "10592235" \
337 | > "16796559" \
338 | > "27899562" \
339 | > "26400175" \
340 | > "8709122"
341 | > do
342 | > esearch -email name@xx.edu -db pubmed -query "$refs[PMID]" |
343 | > efetch -format xml |
344 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
345 | > Author/Initials ArticleTitle ISOAbbreviation PubDate/Year Volume Issue MedlinePgn \
346 | > -block ArticleId -if ArticleId@IdType -equals doi -doi ArticleId
347 | > sleep 1
348 | > done
349 | 20426451 Rogers D Extended-connectivity fingerprints. J Chem Inf Model 2010 50 5 742-54 https://doi.org/10.1021%2Fci100050t
350 | 21982300 O'Boyle NM Open Babel: An open chemical toolbox. J Cheminform 2011 3 33 https://doi.org/10.1186%2F1758-2946-3-33
351 | 21948594 Gaulton A ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 2012 40 Database issue D1100-7https://doi.org/10.1093%2Fnar%2Fgkr777
352 | 12653513 Steinbeck C The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics. J Chem Inf Comput Sci 43 2 493-500 https://doi.org/10.1021%2Fci025584y
353 | 11259830 Lipinski CA Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 2001 46 1-3 3-26 https://doi.org/10.1016%2Fs0169-409x%2800%2900129-0
354 | 10592235 Berman HM The Protein Data Bank. Nucleic Acids Res 2000 28 1 235-42 https://doi.org/10.1093%2Fnar%2F28.1.235
355 | 16796559 Steinbeck C Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics.Curr Pharm Des 2006 12 17 2111-20 https://doi.org/10.2174%2F138161206777585274
356 | 27899562 Gaulton A The ChEMBL database in 2017. Nucleic Acids Res 2017 45 D1 D945-D954 https://doi.org/10.1093%2Fnar%2Fgkw1074
357 | 26400175 Kim S PubChem Substance and Compound databases. Nucleic Acids Res 2016 44 D1 D1202-13 https://doi.org/10.1093%2Fnar%2Fgkv951
358 | 8709122 Bemis GW The properties of known drugs. 1. Molecular frameworks.J Med Chem 1996 39 15 2887-93 https://doi.org/10.1021%2Fjm9602928
359 | ```
360 |
361 | Another interesting question would be what is the most cited Journal in *J Cheminform* articles (in the available PubMed citation data)? In the below script, we take a similar approach to above, but instead of extracting out the PMIDs, we extract out the Citation element. The line `cut -d "." -f 1` deletes any data after the Journal abbreviation (e.g., "Drug Discov Today. 2006 Dec;11(23-24):1046-53" becomes "Drug Discov Today").
362 | ```console
363 |
364 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Cheminform[JOUR]" | \
365 | > efetch -format xml | \
366 | > xtract -pattern Reference -element Citation | \
367 | > cut -d "." -f 1 | \
368 | > sort-uniq-count-rank
369 | 2274 J Chem Inf Model
370 | 1268 J Cheminform
371 | 1167 Nucleic Acids Res
372 | 930 J Med Chem
373 | 620 Bioinformatics
374 | 576 J Chem Inf Comput Sci
375 | 425 J Comput Aided Mol Des
376 | 381 BMC Bioinformatics
377 | 351 Drug Discov Today
378 | 252 Mol Inform
379 | 247 J Comput Chem
380 | 236 PLoS One
381 | 222 Nature
382 | 215 Nat Rev Drug Discov
383 | 202 Proc Natl Acad Sci U S A
384 | 181 Science
385 | 148 Anal Chem
386 | 147 Proteins
387 | 140 J Mol Graph Model
388 | ...
389 | ...
390 | ```
391 | _tested on 2021.01.27, EDirect 14.4._
392 |
393 | Note that there is some inconsistency in the citation formats results here as well that would need to be evaluated and cleaned up for a more thorough analysis. For example, some of the extracted Citations included author names and article titles, so the `cut` command deleting everything after the first `.` does not suffice for those data entries.
394 |
395 |
396 | ### Number of Records in PubMed by Create Date
397 |
398 | Here is an interesting script to retrieve the count of PubMed records by create date (`[CRDT]`) for each month of 2020. Since there are over 100,000 records added to PubMed every month, a strategy using `efetch` likely would not work (i.e., trying to retrieve 500,000+ records would take a long time).
399 |
400 | ```console
401 |
402 | user@computer:~$ for date in \
403 | > "2020/01" \
404 | > "2020/02" \
405 | > "2020/03" \
406 | > "2020/04" \
407 | > "2020/05" \
408 | > "2020/06"
409 | > do
410 | > esearch -email name@xx.edu -db pubmed -query "$date[CRDT]" |
411 | > xtract -pattern ENTREZ_DIRECT -lbl "$date" -element Count
412 | > sleep 1
413 | > done
414 | 2020/01 108863
415 | 2020/02 107561
416 | 2020/03 106386
417 | 2020/04 124575
418 | 2020/05 121324
419 | 2020/06 124664
420 | ```
421 | _tested on 2021.01.27, EDirect 14.4._
422 |
423 |
424 |
425 | ### Number of Records in PubMed that are Also freely Available in PubMed Central
426 |
427 | Let's say we wanted to know how many articles in *J Chem Inf Model* (indexed in PubMed) are available freely in PubMed Central. We can first get a count for *J Chem Inf Model* records in PubMed by querying PubMed in the Journal field (`[JOUR]`), followed by retrieving the records, extracting out the PubDate, and then sorting by frequency:
428 |
429 | ```console
430 |
431 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Chem Inf Model[JOUR]" | \
432 | > efetch -format docsum | \
433 | > xtract -pattern DocumentSummary -element PubDate | \
434 | > cut -d " " -f 1 | \
435 | > sort-uniq-count-rank | \
436 | > sort -k2,2
437 | 216 2005
438 | 280 2006
439 | 246 2007
440 | 225 2008
441 | 268 2009
442 | 203 2010
443 | 297 2011
444 | 306 2012
445 | 300 2013
446 | 309 2014
447 | 247 2015
448 | 232 2016
449 | 283 2017
450 | 237 2018
451 | 490 2019
452 | 612 2020
453 | 65 2021
454 | ```
455 | _tested on 2021.01.27, EDirect 14.4._
456 |
457 |
458 | In the above script, the line `cut -d " " -f 1` deletes any data appearing after the year and `sort -k2,2` sorts the data by the second column. Next, we can add `elink` into our script to find the linked records in PubMed Central (`pmc`) from the Entrez link `pubmed_pmc`:
459 |
460 | ```console
461 |
462 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "J Chem Inf Model[JOUR]" | \
463 | > elink -target pmc -name pubmed_pmc | \
464 | > efetch -format docsum | \
465 | > xtract -pattern DocumentSummary -element PubDate | \
466 | > cut -d " " -f 1 | \
467 | > sort-uniq-count-rank | \
468 | > sort -k2,2
469 | 1 2005
470 | 3 2006
471 | 7 2007
472 | 14 2008
473 | 31 2009
474 | 26 2010
475 | 62 2011
476 | 38 2012
477 | 55 2013
478 | 59 2014
479 | 38 2015
480 | 32 2016
481 | 39 2017
482 | 43 2018
483 | 57 2019
484 | 51 2020
485 | ```
486 |
487 | _tested on 2021.01.27, EDirect 14.4._
488 |
489 | Note that if you have a query returning tens of thousands of results, you would likely want to use a strategy without `efetch`, such as adding a date into your `esearch` query, followed by extracting out the count element from the XML.
490 |
491 |
492 |
--------------------------------------------------------------------------------
/03_EDirect_PubChem_BioAssay_PubMed_Recipes.md:
--------------------------------------------------------------------------------
1 | # PubChem <--> PubChem BioAssay <--> PubMed EDirect Recipes
2 |
3 | **Notes**
4 |
5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`.
6 | > 2. Replace `name@xx.edu` with your email address.
7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal.
8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/).
9 |
10 | ## EDirect PubChem Entrez Links
11 |
12 | ### PubChem Compound --> PubMed Citations
13 | **Description:** Search for a CID in the PubChem Compound Database and retrieve related PubMed linked references.
14 |
15 | In the below script, we use the `esearch` function to query the PubChem Compound database (`pccompound`) for CID 174076 within the Compound ID field, `[uid]`. The `esearch` results are then piped to `elink` finding related PubMed citations via the Entrez link `pccompound_pubmed`. Finally, we retrieve the results with `efetch` in XML format and extract out some bibliographic reference information using the `xtract` function.
16 |
17 | ```console
18 |
19 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 174076[uid] | \
20 | > elink -target pubmed -name pccompound_pubmed | \
21 | > efetch -format xml | \
22 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
23 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
24 | 22957575 Gabl S J Chem Phys 2012 137 9 094501
25 | 22868451 Zhang Y Phys Chem Chem Phys 2012 14 35 12157-64
26 | 22859056 Malberg F Phys Chem Chem Phys 2012 14 35 12079-82
27 | 22852554 Zhang Y J Phys Chem B 2012 116 33 10036-48
28 | 22662183 Zhang BB PLoS ONE 2012 7 5 e37641
29 | ...
30 | ```
31 | _tested on 2021.01.26, EDirect 14.4, total count was 102._
32 |
33 | ### PubChem Compound --> PubMed Citations (with filtering)
34 | **Description:** Search for CID in PubChem Compound Database, find related PubMed citations, then only retrieve references from a specific journal.
35 |
36 | We can filter `elink` results with `efilter` to only include PubMed citations (Entrez linked via `pccompound_pubmed`) to the CID but also matching a specific PubMed query. For example, if we are only interested in linked _Phys Chem Chem Phys_ references to CID 174076, we can use the journal field `[JOUR]` in an `efilter` query:
37 |
38 | ```console
39 |
40 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 174076[uid] | \
41 | > elink -target pubmed -name pccompound_pubmed | \
42 | > efilter -query "Phys Chem Chem Phys"[JOUR] | \
43 | > efetch -format xml | \
44 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
45 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
46 | 22868451 Zhang Y Phys Chem Chem Phys 2012 14 35 12157-64
47 | 22859056 Malberg F Phys Chem Chem Phys 2012 14 35 12079-82
48 | 22451012 Sillars FB Phys Chem Chem Phys 2012 14 17 6094-100
49 | 21643581 Pensado AS Phys Chem Chem Phys 2011 13 30 13518-26
50 | 21643580 Schröder C Phys Chem Chem Phys 2011 13 26 12240-8
51 | ...
52 | ```
53 | _tested on 2021.01.26, EDirect 14.4, total count was 11._
54 |
55 | ### PubChem Compound --> PubMed MeSH (with filtering)
56 | **Description:** Search for a CID in PubChem Compound, find related PubMed records via MeSH, and retrieve only references that contain the MeSH subheading "chemical synthesis".
57 |
58 | This is my favorite literature search: start with a PubChem CID and then find PubMed literature related to its synthesis. Similarly to the search above, we can filter out references using an `efilter` query for 'chemical synthesis' as a MeSH subheading `[SUBH]`. Note that we used the `pccompound_pubmed_mesh` Entrez link as the `elink` target name here.
59 |
60 | ```console
61 |
62 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 94257[uid] | \
63 | > elink -target pubmed -name pccompound_pubmed_mesh | \
64 | > efilter -query "chemical synthesis"[SUBH] | \
65 | > efetch -format xml | \
66 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID ArticleTitle \
67 | > ISOAbbreviation PubDate/Year
68 | 28463562 Enantioselective Chemical Syntheses of the Furanosteroids (-)-Viridin and (-)-Viridiol. J. Am. Chem. Soc. 2017
69 | 23040731 Viridin analogs derived from steroidal building blocks. Bioorg. Med. Chem. Lett. 2012
70 | 22849426 Synthetic studies on furanosteroids: construction of the viridin core structure via Diels-Alder/retro-Diels-Alder and vinylogous Mukaiyama aldol-type reaction. J. Org. Chem. 2012
71 | 19644878 Abrogation of antibody-induced arthritis in mice by a self-activating viridin prodrug and association with impaired neutrophil and endothelial cell function. Arthritis Rheum. 2009
72 | 19572524 Pentacyclic furanosteroids: the synthesis of potential kinase inhibitors related to viridin and wortmannolone. J. Org. Chem. 2009
73 | ...
74 | ```
75 | _tested on 2021.01.26, EDirect 14.4, total count was 8._
76 |
77 |
78 | ### PubChem Compound --> PubMed Citations OR PubMed MeSH
79 | **Description:** Search for a CID in PubChem Compound, find related PubMed citations and related PubMed citations via MeSH.
80 |
81 | It appears that you can combine `elink` queries, with either the same Entrez link or a different Entrez link, but within the same database. For example, if we want to retrieve PubMed literature related to PubChem CID 174076 for both the `pccompound_pubmed` and `pccompound_pubmed_mesh` Entrez links in one dataset, we combine two separate `elink` queries with an OR operator:
82 |
83 | ```console
84 |
85 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 174076[uid] | \
86 | > elink -target pubmed -name pccompound_pubmed -label pubmed_cit | \
87 | > esearch -email name@xx.edu -db pccompound -query 174076[uid] | \
88 | > elink -target pubmed -name pccompound_pubmed_mesh -label pubmed_mesh_cit | \
89 | > esearch -query "(#pubmed_cit) OR (#pubmed_mesh_cit)" | \
90 | > efetch -format xml | \
91 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
92 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
93 | 32231037 Babicka M Molecules 2020 25 7
94 | 31931064 Love SA Int J Biol Macromol 2020 147 569-575
95 | 31818016 Wang F Int J Mol Sci 2019 20 24
96 | 31814059 Weber AL Orig Life Evol Biosph 2019 49 4 199-211
97 | 31675504 Gomez-Herrero E Ecotoxicol Environ Saf 2020 187 109836
98 | 31520950 Pal S Ecotoxicol Environ Saf 2019 184 109634
99 | ...
100 | ```
101 | _tested on 2021.01.26, EDirect 14.4, total count was 317._
102 |
103 |
104 | ### PubChem Substance --> PubChem Compound --> PubMed Publisher
105 | **Description:** Search for a PubChem Substance Data Source Depositor, find related same PubChem Compounds, and then retrieve related PubMed references linked via publisher.
106 |
107 | In the below script, we first search the PubChem Substance (`pcsubstance`) database using `esearch` for the data source depositor _Nature Communications_. We can use the Current Source Name `[CSN]` field for this query. Note that an underscore is put in place of the space in the query. This syntax is important for searching in PubChem with the EDirect `esearch` function. After `esearch`, we pipe the results into `elink` twice, first finding related PubChem Compounds via the `pcsubstance_pccompound_same` Entrez link, and then using this new result list to find related PubMed publisher deposited citations from the `pccompound_pubmed_publisher` Entrez link. Finally, similarly to previous searches, we use a combination of `efetch` and `xtract` to retrieve selected data:
108 |
109 | ```console
110 |
111 | user@computer:~$ esearch -email name@xx.edu -db pcsubstance -query "nature_communications"[CSN] | \
112 | > elink -target pccompound -name pcsubstance_pccompound_same | \
113 | > elink -target pubmed -name pccompound_pubmed_publisher | \
114 | > efetch -format xml |
115 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName Author/Initials \
116 | > ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
117 | 26673265 Gilbert ZW Nat Chem 2016 8 1 63-8
118 | 25424885 Yan T Nat Commun 2014 5 5602
119 | 25422853 Vaidya AB Nat Commun 2014 5 5521
120 | 25382411 Dommerholt J Nat Commun 2014 5 5378
121 | 25382259 Wang B Nat Commun 2014 5 5354
122 | ...
123 | ```
124 |
125 | _tested on 2021.01.26, EDirect 14.4, total count was 101._
126 |
127 |
128 | ### PubChem Substance --> PubChem Compound <--> PubMed Publisher
129 | **Description:** Search for a PubChem Substance Data Source Depositor, find related same PubChem Compounds, and then retrieve related PubMed PMIDs linked via publisher.
130 |
131 | Building upon the previous search, if needed, it is possible to obtain individual relationships of the CIDs to PubMed IDs (CID <--> PMID). We can do this using the `-cmd neighbor` option in `elink`:
132 |
133 | ```console
134 |
135 | user@computer:~$ esearch -email name@xx.edu -db pcsubstance -query "nature_communications"[CSN] | \
136 | > elink -target pccompound -name pcsubstance_pccompound_same | \
137 | > elink -target pubmed -name pccompound_pubmed_publisher -cmd neighbor | \
138 | > xtract -pattern LinkSet -element Id
139 | 146033657 24398593
140 | 136286496
141 | 136264969 24177669
142 | 136264968 24177669
143 | 136262920 23385592
144 | 136262919 23385592
145 | 136247006 24457545
146 | 136247005 24457545
147 | 136247004 24457545
148 | 136247003 24457545
149 | 136219971
150 | 135922679 22027590
151 | 91868204 23764831
152 | ...
153 | ```
154 | _tested on 2021.01.26, EDirect 14.4, total count was 1594 (returns all CIDs, not all have linked PMIDs)._
155 |
156 | The first column contains the PubChem CIDs and the second column contains the linked PMIDs. Additional linked PMIDs are placed in subsequent columns when available.
157 |
158 |
159 | ### PubChem Compound --> PubChem BioAssay
160 | **Description:** Search for a PubChem CID in PubChem Compound, then retrieve related PubChem active BioAssay data.
161 |
162 | To retrieve BioAssay results labeled as 'Active' that are linked to a CID, we can use the `elink` function with the PubChem BioAssay (`pcassay`) database via Entrez link `pccompound_pcassay_active`. This is followed by `efetch` and `xtract`. In this particular example, we extracted the AID, CurrentSourceName, AssayName, ActiveSidCount, and TargetCount:
163 |
164 | ```console
165 |
166 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "6303"[uid] | \
167 | > elink -target pcassay -name pccompound_pcassay_active | \
168 | > efetch -format docsum | \
169 | > xtract -pattern DocumentSummary -element AID CurrentSourceName AssayName ActiveSidCount TargetCount
170 | 1255098 ChEMBL Inhibition of TLR4-mediated NF-kappaB signaling pathway in BALB/c mouse RAW264.7 cells assessed as suppression of LPS-stimulated PGE2 production at 1 to 10 ug/ml preincubated for 1 hr followed by LPS challenge measured after 6 hrs by immunoblot analysis 1 1
171 | 1255092 ChEMBL Inhibition of TLR4-mediated NF-kappaB signaling pathway in BALB/c mouse RAW264.7 cells assessed as suppression of LPS-stimulated TNF-alpha production at 1 to 10 ug/ml preincubated for 1 hr followed by LPS challenge measured after 6 hrs by immunoblot analysis 1 1
172 | 751324 ChEMBL Inhibition of NFkappaB p65 nuclear translocation in mouse RAW264.7 cells after 24 hrs by DAPI staining-based laser confocal immunofluorescent microscopic analysis 1 1
173 |
174 | ...
175 | ```
176 | _tested on 2021.01.26, EDirect 14.4, total count was 47._
177 |
178 |
179 | ### PubChem Compound <--> PubChem BioAssay
180 | **Description:** Search for a PubChem CID in PubChem Compound Database, find related compounds with same connectivity, then retrieve related AIDs for each CID.
181 |
182 | It is possible to obtain individual relationships of the CIDs to BioAssay AIDs (CID <--> AID). We can do this using the `-cmd neighbor` option in `elink`. Note that we first found related compounds with same connectivity using the Entrez link `pccompound_pccompound_sameconnectivity_pulldown`. This step was followed by the `pccompound_pcassay_active` Entrez link in the PubChem BioAssay database to retrieve AID links to the CIDs. We used the 'Active' assay links here. There are also other Entrez PubChem Compound assay links such as inactive, `pccompound_pcassay_inactive`.
183 |
184 | ```console
185 |
186 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query "6303"[uid] | \
187 | > elink -target pccompound -name pccompound_pccompound_sameconnectivity_pulldown | \
188 | > elink -target pcassay -name pccompound_pcassay_active -cmd neighbor | \
189 | > xtract -pattern LinkSet -element Id
190 | ...
191 | ...
192 | 6335098
193 | 688425
194 | 451875 1347103 1296009 2551 2546
195 | 248010 687016 652245 651719 177 175 173 171 169 167 165 163 161 159 157 155 153 149 147
196 | 6303 1346987 1259407 1255098 1255092 1207585 1207584 1207579 1207578 1207577 1207576 1167619 1159565 1159562 1159559 1159557 1065715 1065714 1065710 1065706 1065697 1065696 1065695 1065713 1065705 1065699 751324 686979 686978 651820 652245 651719 602346 602250 588511 493002 463218 463212 416870 416743 216185 86858 81069 32353 32352 31719 31718 2467
197 | ```
198 | _tested on 2021.01.26, EDirect 14.4, total count was 24 CIDs (not all have associated AIDs)._
199 |
200 |
201 | ## EDirect PubMed Entrez Links
202 |
203 | ### PubMed --> PubChem Compound
204 | **Description:** Search for a PubMed article ID (PMID), then retrieve related PubChem Compounds.
205 |
206 | In the below script, we first use `esearch` to query PubMed for the article ID 29407984 in the `[PMID]` field. This result is then piped into `elink` to retrieve linked compounds in the PubChem Compound database (`pubmed_pccompound`). In this case, there was one compound and we used `efetch` to retrieve the CID record as docsum XML, followed by `xtraxt` to extract out the IsomericSmiles, CID, and InChIKey values.
207 |
208 | ```console
209 |
210 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29407984"[PMID] | \
211 | > elink -target pccompound -name pubmed_pccompound | \
212 | > efetch -format docsum | \
213 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey
214 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O 2764 MYSWGUAQZAJSOK-UHFFFAOYSA-N
215 |
216 | ```
217 | _tested on 2021.01.26, EDirect 14.4, total count was 1._
218 |
219 |
220 | ### PubMed --> PubChem Compound (+ mixtures)
221 | **Description:** Search for a PubMed article ID (PMID), then retrieve linked PubChem Compound mixtures/components.
222 |
223 | In this script, an additional `elink` search is added to find related PubChem Mixture/Component compounds via Entrez link `pccompound_pccompound_mixture`.
224 |
225 | ```console
226 |
227 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "29407984"[PMID] | \
228 | > elink -target pccompound -name pubmed_pccompound | \
229 | > elink -target pccompound -name pccompound_pccompound_mixture | \
230 | > efetch -format docsum | xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey
231 | ...
232 | ...
233 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O.C(CO)N(CCO)CCO 154963193 NGBBVVPJSFAHHI-UHFFFAOYSA-N
234 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O.C(=O)(C(=O)O)O.[Na] 154963186 HNZYVRRDOWOQIT-UHFFFAOYSA-N
235 | CNC.C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O 154963184 NBGZCMHVBXSHSN-UHFFFAOYSA-N
236 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)[OH2+] 153275427 MYSWGUAQZAJSOK-UHFFFAOYSA-O
237 | C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)[O-] 152748405 MYSWGUAQZAJSOK-UHFFFAOYSA-M
238 | ...
239 | ...
240 | ```
241 | _tested on 2021.01.26, EDirect 14.4, total count was 375._
242 |
243 |
244 |
245 | ### PubMed --> PubChem Compound (MESH search)
246 | **Description:** Search PubMed with a text query, then retrieve linked PubChem Compounds.
247 |
248 | We can also perform text queries in PubMed and retrieve linked PubChem Compounds. Note that in the below script we searched for "ionic liquids" in the `[MESH]` field and Imidazolium in any field. Since this query requires two pairs of quotes, we have to escape the internal quotes in order for the query to be interpreted correctly. The Entrez link `pubmed_pccompound` was used to find related PubChem compounds.
249 |
250 | ```console
251 |
252 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "\"ionic liquids\"[MESH] AND imidazolium" | \
253 | > elink -target pccompound -name pubmed_pccompound | \
254 | > efetch -format docsum | \
255 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey
256 | C1=CC2=CC(=C(C(=C2C(=O)C(=C1)O)O)O)O 135403797 WDGFFVCWBZVLCE-UHFFFAOYSA-N
257 | C1=NC2=C(N1[C@H]3[C@@H]([C@@H]([C@H](O3)CO)O)O)N=C(NC2=O)N 135398635 NYHBQMYGNKIUIF-UUOKFMHZSA-N
258 | CCCCCCCCN1C=C[N+](=C1C2=[N+](C=CN2CCCC)C)C 123995430 DRJFJBHYMOHPHX-UHFFFAOYSA-N
259 | CC(=O)OC1=[N+](C=CN1CC=C)C 123614562 XSXMFLUARQMOLS-UHFFFAOYSA-N
260 | C[N+]1=C(N(C=C1)CCCCCCCCCCCCS)OC(=O)OC2=[N+](C=CN2CCCCCCCCCCCCS)C 123431445 DRJOSAVMFCYCSU-UHFFFAOYSA-P
261 | ...
262 | ```
263 | _tested on 2021.01.27, EDirect 14.4, total count was 395._
264 |
265 | ### PubMed --> PubChem Compound (MESH search, and a PubChem filter)
266 | **Description:** Search PubMed with a text query and retrieve only linked compounds containing defined chiral atoms.
267 |
268 | We can also perform some powerful filtering with `efilter`. In the below script, the `[ACDC]` field is the defined atom chiral count in PubChem. A range of 1 through 100 was added for this `[ACDC]` filter. Since it is unlikely that any of the compounds would have near 100 chiral atoms, we can be fairly confident this should capture most, if not all, cases in our search.
269 |
270 | ```console
271 |
272 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "\"ionic liquids\"[MESH] AND imidazolium" | \
273 | > elink -target pccompound -name pubmed_pccompound | \
274 | > efilter -query "1:100"[ACDC] | \
275 | > efetch -format docsum | \
276 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey
277 | C1=NC2=C(N1[C@H]3[C@@H]([C@@H]([C@H](O3)CO)O)O)N=C(NC2=O)N 135398635 NYHBQMYGNKIUIF-UUOKFMHZSA-N
278 | C([C@H](C(=O)[C@H](CO)O)O)O 54067296 WXYXERHRDKEISL-ZXZARUISSA-N
279 | B(O)(O)OCC(=O)[C@H]([C@@H]([C@@H](CO)O)O)O 53705729 BXAZSOZNRVUIGN-UYFOZJQFSA-N
280 | C([C@H]1[C@@H]([C@H]([C@@H]([C@@H](O1)O[C@@H]2[C@@H](O[C@H]([C@@H]([C@H]2O)O)O)CO)O)O)O)O 46936190 GUBGYTABKSRVRQ-AEDSEYDFSA-N
281 | C1[C@H](OC2=CC(=CC(=C2C1=O)O)OC3C(C(C(C(O3)CO)O)O)O)C4=CC=C(C=C4)O 42607902 DLIKSSGEMUFQOK-CEFFZDIVSA-N
282 | ...
283 | ```
284 | _tested on 2021.01.27, EDirect 14.4, total count was 43._
285 |
286 |
287 | ### PubMed --> PubChem Compounds + PubChem Compounds (MeSH) + PubChem Compounds (Publisher)
288 | **Description:** Search PubMed, then find linked PubChem Compounds, PubChem Compounds via PubMed MeSH, and PubChem Compound PubMed Publisher.
289 |
290 | As seen in the previous PubChem searches, there are several Entrez links from PubMed to PubChem Compound such as `pubmed_pccompound`, `pubmed_pccompound_mesh`, and `pubmed_pccompound_publisher`. We can retrieve associated compounds from all three at the same time like this:
291 |
292 | ```console
293 |
294 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "imidazolium AND bacteria" | \
295 | > elink -target pccompound -name pubmed_pccompound -label compounds_01 | \
296 | > esearch -email name@xx.edu -db pubmed -query "imidazolium AND bacteria" | \
297 | > elink -target pccompound -name pubmed_pccompound_mesh -label compounds_02 | \
298 | > esearch -email name@xx.edu -db pubmed -query "imidazolium AND bacteria" | \
299 | > elink -target pccompound -name pubmed_pccompound_publisher -label compounds_03 | \
300 | > esearch -query "(#compounds_01) OR (#compounds_02) OR (#compounds_03)" | \
301 | > efetch -format docsum | \
302 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey
303 | C[Si](C)(O)O[Si](C)(C)O.C[Si](CCC1=CC=CC=C1)(O)O[Si](C)(CCC2=CC=CC=C2)O 155288862 HYTMPCHVCUAIOW-UHFFFAOYSA-N
304 | CC(C1CCC(C(O1)OC2C(CC(C(C2O)OC3C(C([C@@](CO3)(C)O)NC)O)N)N)N)NC 146157093 CEAZRRDELHUEMR-NWNXOGAHSA-N
305 | C1=CC=C2C(=C1)C=C(N2)CC3=CC=C(C=C3)C(F)(F)F.C1=CC=C2C(=C1)C=C(N2)CC3=CC=C(C=C3)C(F)(F)F 139191468 NNXWVRXROOQQNO-UHFFFAOYSA-N
306 | CC[C@@H]1[C@@]2([C@@H]([C@H](C(=O)[C@@H](C[C@@]([C@@H]([C@H](C(=O)[C@H](C(=O)O1)C)C)O[C@@H]3[C@@H]([C@H](C[C@H](O3)C)N(C)C)O)(C)OC)C)C)N(C(=O)O2)CCCCN4C=C(N=C4)C5=CN=CC=C5)C 138402871 LJVAJPDWBABPEJ-WMGYHEQLSA-N
307 | CCOC(=O)/C(=N\NC1=CC=CC2=C1N=CC=C2)/C3=[N+](C=CN3)C 136199795 ZHOWQGCGVQGEKD-UHFFFAOYSA-O
308 | ...
309 | ...
310 | ```
311 | _tested on 2021.01.27, EDirect 14.4, total count was 568._
312 |
313 |
314 | ### PubMed <--> PubChem Compound
315 | **Description:** Search PubMed for an affiliation, find related PubChem Compounds, then retrieve related CIDs for each PMID.
316 |
317 | If we want to retrieve the PMID <--> CID relationships (for Entrez link `pubmed_pccompound`), we can achieve this using the `-cmd neighbor` option in `elink`:
318 |
319 | ```console
320 |
321 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "(university of alabama[AFFL]) \
322 | > NOT (birmingham[AFFL] OR huntsville[AFFL])" \
323 | > -datetype PDAT -mindate 2010 -maxdate 2020 | \
324 | > elink -target pccompound -name pubmed_pccompound -cmd neighbor | \
325 | > xtract -pattern LinkSet -element Id
326 | ...
327 | ...
328 | 21800250
329 | 21783326 18679079 8914 942 702
330 | 21782896
331 | 21756136
332 | 21728552
333 | 21718269
334 | 21711000 561577 169577 166929 166928 164636
335 | 21702462
336 | 21693669
337 | 21692575
338 | ...
339 | ...
340 | ```
341 |
342 | _tested on 2021.01.27, EDirect 14.4, total count was 3639 (returns all PMIDs, not all have linked CIDs)._
343 |
344 | The first column contains the PMIDs and the second column contains the linked PubChem CIDs (from the `pubmed_pccompound` links). As an aside, the PubMed query for "university of alabama" in the affiliation field (`[AFFL]`) excludes (NOT operator) any results containing huntsville or birmingham in the affiliation. This excludes references from University of Alabama at Birmingham and University of Alabama at Huntsville (including collaborative references with the Tuscaloosa campus).
345 |
346 |
347 | ### PubMed --> PubChem BioAssay
348 | **Description:** Search PubMed for an article, find related PubChem BioAssays, then retrieve some BioAssay data.
349 |
350 | ```console
351 |
352 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "32459468"[PMID] | \
353 | > elink -target pcassay -name pubmed_pcassay | \
354 | > efetch -format docsum | \
355 | > xtract -pattern DocumentSummary -element AID CurrentSourceName AssayName ActiveSidCount TargetCount
356 | 1347414 National Center for Advancing Translational Sciences (NCATS) qHTS to identify inhibitors of the type 1 interferon - major histocompatibility complex class I in skeletal muscle: Secondary screen by immunofluorescence 0 1
357 | 1347412 National Center for Advancing Translational Sciences (NCATS) qHTS assay to identify inhibitors of the type 1 interferon - major histocompatibility complex class I in skeletal muscle: Counter screen cell viability and HiBit confirmation 0 1
358 | 1347415 National Center for Advancing Translational Sciences (NCATS) qHTS to identify inhibitors of the type 1 interferon - major histocompatibility complex class I in skeletal muscle: tertiary screen by RT-qPCR 34 1
359 | 1347413 National Center for Advancing Translational Sciences (NCATS) qHTS to identify inhibitors of the type 1 interferon - major histocompatibility complex class I in skeletal muscle: tertiary screen by RT-qPCR, retest select compounds 3 1
360 | ...
361 | ```
362 | _tested on 2021.01.27, EDirect 14.4, total count was 7._
363 |
364 | ### PubMed <--> PubChem BioAssay
365 | **Description:** Search PubMed for an article, find cited articles, then related PubChem BioAssays.
366 |
367 | If we want to retrieve the PMID <--> AID relationships (for Entrez link `pubmed_pcassay`), we can achieve this using the `-cmd neighbor` option in `elink`. Note that here we queried PubMed for an article, then found the cited articles with `elink -cited`, before piping these results into the Entrez link `pubmed_pcassay`.
368 |
369 |
370 | ```console
371 |
372 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17876319"[PMID] | \
373 | > elink -cited | \
374 | > elink -target pcassay -name pubmed_pcassay -cmd neighbor | \
375 | > xtract -pattern LinkSet -element Id
376 | ...
377 | ...
378 | 21167154
379 | 21164511
380 | 21159777
381 | 21138309 568760 568754 568753 568763 568762 568761 568759 568758 568757 568756 568755
382 | 21131971
383 | 21129186
384 | ...
385 | ...
386 | ```
387 | _tested on 2021.01.27, EDirect 14.4, total count was 366 (returns all PMIDs, not all have linked AIDs)._
388 |
389 | In the above table, the first column contains the PMIDs, subsequent columns contain the linked BioAssays (AIDs).
390 |
391 | ## EDirect PubChem BioAssay Entrez Links
392 |
393 | ### PubChem BioAssay --> PubMed
394 | **Description:** Search PubChem BioAssay for assays from a specific source name and then find related PubMed literature.
395 |
396 | In the below script, we first use `esearch` to query PubChem BioAssay for IUPHAR/BPS_Guide_to_PHARMACOLOGY in the Source Name field (`[SNME]`). This result is then piped into `elink` to retrieve linked records in the PubMed database (`pcassay_pubmed`). The `efilter` function was used to limit the results to the last 5 years. This resulted in 332 record, and we used `efetch` to retrieve the PubMed records as XML, followed by `xtract` to extract out some bibliographic information.
397 |
398 | ```console
399 |
400 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "IUPHAR/BPS_Guide_to_PHARMACOLOGY"[SNME] | \
401 | > elink -target pubmed -name pcassay_pubmed | \
402 | > efilter -mindate 2015 -maxdate 2020 -datetype PDAT | \
403 | > efetch -format xml | \
404 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName \
405 | > Author/Initials ISOAbbreviation PubDate/Year Volume Issue MedlinePgn
406 | 29722898 Fu R Br. J. Pharmacol. 2018 175 14 3034-3049
407 | 29688582 Kato M Br J Clin Pharmacol 2018 84 8 1821-1829
408 | 29683659 Pike KG J. Med. Chem. 2018 61 9 3823-3841
409 | 29674331 Kawaharada S J. Pharmacol. Exp. Ther. 2018 366 1 58-65
410 | 29672049 Gucký T J. Med. Chem. 2018 61 9 3855-3869
411 | 29620892 Nikolaou A J. Med. Chem. 2018 61 8 3697-3711
412 | 29615471 Xu X J. Pharmacol. Exp. Ther. 2018 365 3 624-635
413 | 29608575 Taylor Meadows KR PLoS ONE 2018 13 4 e0193236
414 | ...
415 | ```
416 | _tested on 2021.01.27, EDirect 14.4, total count was 332._
417 |
418 | ### PubChem BioAssay --> PubChem Compound
419 | **Description:** Search PubChem BioAssay for an assay, find related PubChem Compounds, and retrieve some property data for the compounds.
420 |
421 | In the below script, we first use `esearch` to query PubChem BioAssay for the assay ID 527855 in the `[UID]` field. This result is then piped into `elink` to retrieve linked compounds in the PubChem Compound database (`pcassay_pccompound`). In this case, there were 16 compounds and we used `efetch` to retrieve the CID records as docsum XML, followed by `xtract` to extract the IsomericSmiles, CID, HydrogenBondDonorCount, HydrogenBondAcceptorCount, MolecularWeight, and XLogP values.
422 |
423 | ```console
424 |
425 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "527855"[UID] | \
426 | > elink -target pccompound -name pcassay_pccompound | \
427 | > efetch -format docsum | \
428 | > xtract -pattern DocumentSummary -element IsomericSmiles CID HydrogenBondDonorCount HydrogenBondAcceptorCount \
429 | > MolecularWeight XLogP
430 | CN(CC1=CC=CC=C1)C(=O)C2=C(NC(=N2)C3=CC=CC=C3)C(=O)O 52949178 2 4 335.400 2.9
431 | C1=CC=C(C=C1)CCN(CC2=CC=CC=C2)C(=O)C3=C(NC(=N3)C4=CC=CC=C4)C(=O)O 52948352 2 4 425.500 4.8
432 | CN(CC(=O)O)C(=O)C1=C(NC(=N1)C2=CC=CC=C2)C(=O)O 52947957 3 6 303.270 1
433 | C1=CC=C(C=C1)CNC(=O)C2=C(NC(=N2)C3=CC=CC=C3)C(=O)O 52946755 3 4 321.300 2.7
434 | CCOC(=O)CN(CC1=CC=CC=C1)C(=O)C2=C(NC(=N2)C3=CC=CC=C3)C(=O)O 52945544 2 6 407.400 3.2
435 | C1=CC=C(C=C1)CN(CC2=CC=CC=C2)C(=O)C3=C(NC(=N3)C4=CC(=CC=C4)Cl)C(=O)O 52944295 2 4 445.900 5
436 | CCNC(=O)C1=C(NC(=N1)C2=CC=CC=C2)C(=O)O 52941818 3 4 259.260 1.6
437 | CNC(=O)C1=C(NC(=N1)C2=CC=CC=C2)C(=O)O 52941817 3 4 245.230 1.2
438 | ...
439 | ```
440 | _tested on 2021.01.27, EDirect 14.4, total count was 16._
441 |
442 | ### PubChem BioAssay <--> PubChem Compound
443 | **Description:** Search PubChem BioAssay for an assay, find related assays based on similar publications, then find related PubChem Compounds.
444 |
445 | If we want to retrieve the AID <--> CID relationships (for Entrez link `pcassay_pccompound`), we can achieve this using the `-cmd neighbor` option in `elink`. Here we queried PubChem BioAssay for an assay, then found related assays by similar publication list using `elink` (`pcassay_pcassay_similar_publication_list`). This result was then piped into the Entrez link `pcassay_pccompound`.
446 |
447 |
448 | ```console
449 |
450 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "527855"[UID] | \
451 | > elink -target pcassay -name pcassay_pcassay_similar_publication_list | \
452 | > elink -target pccompound -name pcassay_pccompound -cmd neighbor | \
453 | > xtract -pattern LinkSet -element Id
454 | ...
455 | ...
456 | 601409 54580326
457 | 601408 54580326 53257623
458 | 601154 54580326
459 | 657046 16093559
460 | 657045 70695880 70693764 70687505 70687504 70687503 70683264 70683263 70681155 70681154 70681153 60150625
461 | 527862 52948352
462 | 527861 52948352
463 | ...
464 | ...
465 | ...
466 | ```
467 |
468 | In the above table, the first column contains the AIDs, subsequent columns contain the linked PubChem Compounds (CIDs).
469 |
470 | _tested on 2021.01.27, EDirect 14.4, total count was 81 (returns all AIDs, not all have linked CIDs)._
471 |
472 |
--------------------------------------------------------------------------------
/02_EDirect_Data_Fields_Structure.md:
--------------------------------------------------------------------------------
1 | # Available EDirect Databases, Data Fields, and Data Structures
2 |
3 | **Notes**
4 |
5 | > 1. `user@computer:~$` represents an example terminal prompt name. Actual command/argument input is after the `$`.
6 | > 2. Replace `name@xx.edu` with your email address.
7 | > 3. `\` followed by `>` on the next line represents continued terminal input. You will need to delete the `>` symbol in order to run the scripts as a copy/paste into terminal.
8 | > 4. You should validate your own EDirect scripts and results as there may be unintentional mistakes in these recipes. A convenient method is to compare your EDirect results to the NCBI Web interface search results: [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/).
9 |
10 | We can view available Entrez databases, data fields, and links (connected records) with the EDirect `einfo` function. To retrieve a list of all databases, use the `-dbs` argument:
11 |
12 | ```console
13 |
14 | user@computer:~$ einfo -email name@xx.edu -dbs
15 | annotinfo
16 | assembly
17 | biocollections
18 | bioproject
19 | biosample
20 | biosystems
21 | blastdbinfo
22 | books
23 | cdd
24 | clinvar
25 | dbvar
26 | gap
27 | gapplus
28 | gds
29 | gene
30 | genome
31 | geoprofiles
32 | grasp
33 | gtr
34 | homologene
35 | ipg
36 | medgen
37 | mesh
38 | ncbisearch
39 | nlmcatalog
40 | nuccore
41 | nucleotide
42 | omim
43 | orgtrack
44 | pcassay
45 | pccompound
46 | pcsubstance
47 | pmc
48 | popset
49 | protein
50 | proteinclusters
51 | protfam
52 | pubmed
53 | seqannot
54 | snp
55 | sra
56 | structure
57 | taxonomy
58 |
59 | ```
60 |
61 | ## PubChem Compound EDirect Fields, Links, and Data
62 |
63 | This EDirectChemInfo repository focuses on searching the PubChem Compound, PubMed, and PubChem BioAssay databases. So let's take a closer look at these three databases, starting with the PubChem Compound (`-db pccompound`) database. The `einfo` arguments `-fields` and `-links` provide information about the available data fields and linked information, respectively:
64 |
65 | ```console
66 |
67 | user@computer:~$ einfo -email name@xx.edu -db pccompound -fields
68 | AC ActiveAidCount
69 | ACC AtomChiralCount
70 | ACDC AtomChiralDefCount
71 | ACUC AtomChiralUndefCount
72 | ALL All Fields
73 | BCC BondChiralCount
74 | BCDC BondChiralDefCount
75 | BCUC BondChiralUndefCount
76 | CDAT CreateDate
77 | CPLX Complexity
78 | CSYN CompleteSynonym
79 | CUC CovalentUnitCount
80 | DCNT DepositorCount
81 | DCSY DepositorCompleteSynonym
82 | DSYN DepositorSynonym
83 | ELMT Element
84 | EMAS ExactMass
85 | FILT Filter
86 | HAC HeavyAtomCount
87 | HBAC HydrogenBondAcceptorCount
88 | HBDC HydrogenBondDonorCount
89 | IAC IsotopeAtomCount
90 | IKEY InChIKey
91 | INCH InChI
92 | MMAS MonoisotopicMass
93 | MSHT MeSHTerm
94 | MW MolecularWeight
95 | PAID PharmActionID
96 | PHMA PharmAction
97 | RBC RotatableBondCount
98 | SID SubstanceID
99 | SRCC SourceCategory
100 | SRC SourceName
101 | STID StructureID
102 | SYNO Synonym
103 | TAC TotalAidCount
104 | TFC TotalFormalCharge
105 | TPSA TPSA
106 | UID CompoundID
107 | UPAC IUPACName
108 | XLGP XLogP
109 |
110 | user@computer:~$ einfo -email name@xx.edu -db pccompound -links
111 | pccompound_biosystems BioSystems
112 | pccompound_gene Gene
113 | pccompound_mesh MeSH Keyword
114 | pccompound_nuccore Nucleotide Sequences
115 | pccompound_omim OMIM
116 | pccompound_pcassay BioAssays
117 | pccompound_pcassay_active BioAssays, Active
118 | pccompound_pcassay_activityconcmicromolar BioAssays, activity concentration at/below 1 uM
119 | pccompound_pcassay_activityconcnanomolar BioAssays, activity concentration at/below 1 nM
120 | pccompound_pcassay_inactive BioAssays, Inactive
121 | pccompound_pcassay_probe BioAssays, Probe
122 | pccompound_pccompound Similar Compounds
123 | pccompound_pccompound_3d Similar Conformers
124 | pccompound_pccompound_mixture Mixture/Component Compounds
125 | pccompound_pccompound_parent Parent Compound
126 | pccompound_pccompound_parent_connectivity_pulldown Same Parent, Connectivity
127 | pccompound_pccompound_parent_isotopes_pulldown Same Parent, Isotopes
128 | pccompound_pccompound_parent_pulldown Same Parent
129 | pccompound_pccompound_parent_stereo_pulldown Same Parent, Stereochemistry
130 | pccompound_pccompound_parent_tautomer_pulldown Same Parent, Any Tautomer
131 | pccompound_pccompound_sameanytautomer_pulldown Same, Any Tautomer
132 | pccompound_pccompound_sameconnectivity_pulldown Same, Connectivity
133 | pccompound_pccompound_sameisotopic_pulldown Same, Isotopes
134 | pccompound_pccompound_samestereochem_pulldown Same, Stereochemistry
135 | pccompound_pcsubstance PubChem Mixture Substances
136 | pccompound_pcsubstance_same PubChem Same Substances
137 | pccompound_pmc PMC Articles
138 | pccompound_protein Protein Sequences
139 | pccompound_pubmed PubMed Citations
140 | pccompound_pubmed_mesh PubMed (MeSH Keyword)
141 | pccompound_pubmed_publisher PubMed (Publisher)
142 | pccompound_structure Protein Structures
143 | pccompound_taxonomy Taxonomy
144 |
145 | ```
146 | Now that we have an understanding about what kind of data is available in the PubChem Compound database, let's take a look at a PubChem Compound record using the `esearch` and `efetch` functions:
147 |
148 | ```console
149 |
150 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 512323[UID]
151 |
152 | pccompound
153 | MCID...
154 | 1
155 | 1
156 | 1
157 | name@xx.edu
158 |
159 |
160 | ```
161 |
162 | We searched PubChem for the Compound Identifier 512323 using `esearch`, and the NCBI Entrez server returned a summary of the search results. The WebEnV and QueryKey specify the location of the search results on the NCBI server. In order to retrieve the data, we can pipe (`|`) the `esearch` results directly into the `efetch` function:
163 |
164 | ```console
165 |
166 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 512323[UID] | \
167 | > efetch -format docsum
168 |
169 |
170 |
171 | Build210125-0720m.1
172 |
173 | 512323
174 | 512323
175 |
176 | Chemical Vendors
177 | Governmental Organizations
178 | Subscription Services
179 | Curation Efforts
180 | Journal Publishers
181 | Research and Development
182 | Legacy Depositors
183 |
184 | 2005/08/01 00:00
185 |
186 | CHEMBL1791149
187 | 89647-10-9
188 | Uridine, 2'-deoxy-5-(2-thienyl)-
189 | 5-(2'-Thienyl)-2'-beta-deoxyuridine
190 | SCHEMBL1635430
191 | 5-thien-2-yl-2'-deoxyuridine
192 | CTK2J2646
193 | 5-(2-thienyl)-2'-deoxyuridine
194 | 5-(2'-Thienyl)-2'-deoxyuridine-
195 | BDBM50407986
196 | 1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]-5-thiophen-2-ylpyrimidine-2,4-dione
197 | 1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)tetrahydrofuran-2-yl]-5-(2-thienyl)pyrimidine-2,4-dione
198 |
199 | 1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]-5-thiophen-2-ylpyrimidine-2,4-dione
200 | C1C(C(OC1N2C=C(C(=O)NC2=O)C3=CC=CS3)CO)O
201 | C1[C@@H]([C@H](O[C@H]1N2C=C(C(=O)NC2=O)C3=CC=CS3)CO)O
202 | 3
203 | C13H14N2O5S
204 | 310.330
205 | 310330
206 | 0
207 | -0.2
208 | 3
209 | 6
210 | 483.000
211 | 483000
212 | 21
213 | 3
214 | 3
215 | 0
216 | 0
217 | 0
218 | 0
219 | 0
220 | 1
221 | 127
222 | 1
223 | 37
224 | PCDQBRGMSMVLDZ-IQJOONFLSA-N
225 | 0
226 | InChI=1S/C13H14N2O5S/c16-6-9-8(17)4-11(20-9)15-5-7(10-2-1-3-21-10)12(18)14-13(15)19/h1-3,5,8-9,11,16-17H,4,6H2,(H,14,18,19)/t8-,9+,11+/m0/s1
227 |
228 |
229 |
230 | ```
231 |
232 | `efetch` returned the CID record data as document summary XML format (for other formats see `efetch -help`). XML is useful, but we probably want to parse the data into a table for easier viewing and analysis. EDirect contains a function called `xtract` that can convert the Entrez XML data into tables. See `xtract -help` for more information. In brief, you will need to select a main XML heading tag to define the extract pattern and then specify the data you want to extract with the sub-heading tag names (elements). For example, in the above CID record data, we can set the pattern to DocumentSummary (the first main XML tag in this case), and then the elements to a few of the sub-heading tags we are interested such as IsomericSmiles, CID, InChIKey, MolecularFormula, and MolecularWeight:
233 |
234 | ```console
235 |
236 | user@computer:~$ esearch -email name@xx.edu -db pccompound -query 512323[UID] | \
237 | > efetch -format docsum | \
238 | > xtract -pattern DocumentSummary -element IsomericSmiles CID InChIKey MolecularFormula MolecularWeight
239 | C1[C@@H]([C@H](O[C@H]1N2C=C(C(=O)NC2=O)C3=CC=CS3)CO)O 512323 PCDQBRGMSMVLDZ-IQJOONFLSA-N C13H14N2O5S 310.330
240 |
241 | ```
242 |
243 | ## PubMed EDirect Fields, Links, and Data
244 |
245 | Similarly to PubChem Compound, let's preview the available PubMed database indexed fields, links, and data structure:
246 |
247 | ```console
248 |
249 | user@computer:~$ einfo -email name@xx.edu -db pubmed -fields
250 | AFFL Affiliation
251 | ALL All Fields
252 | AUCL Author Cluster ID
253 | AUID Author - Identifier
254 | AUTH Author
255 | BOOK Book
256 | CDAT Date - Completion
257 | CNTY Place of Publication
258 | COIS Conflict of Interest Statements
259 | COLN Author - Corporate
260 | CRDT Date - Create
261 | DSO DSO
262 | ECNO EC/RN Number
263 | EDAT Date - Entrez
264 | ED Editor
265 | EID Extended PMID
266 | EPDT Electronic Publication Date
267 | FAUT Author - First
268 | FILT Filter
269 | FINV Investigator - Full
270 | FULL Author - Full
271 | GRNT Grant Number
272 | INVR Investigator
273 | ISBN ISBN
274 | ISS Issue
275 | JOUR Journal
276 | LANG Language
277 | LAUT Author - Last
278 | LID Location ID
279 | MAJR MeSH Major Topic
280 | MDAT Date - Modification
281 | MESH MeSH Terms
282 | MHDA Date - MeSH
283 | OTRM Other Term
284 | PAGE Pagination
285 | PAPX Pharmacological Action
286 | PDAT Date - Publication
287 | PID Publisher ID
288 | PPDT Print Publication Date
289 | PS Subject - Personal Name
290 | PTYP Publication Type
291 | PUBN Publisher
292 | SI Secondary Source ID
293 | SUBH MeSH Subheading
294 | SUBS Supplementary Concept
295 | TIAB Title/Abstract
296 | TITL Title
297 | TT Transliterated Title
298 | UID UID
299 | VOL Volume
300 | WORD Text Word
301 |
302 | user@computer:~$ einfo -email name@xx.edu -db pubmed -links
303 | pubmed_assembly Assembly
304 | pubmed_bioproject Project Links
305 | pubmed_biosample BioSample Links
306 | pubmed_biosystems BioSystem Links
307 | pubmed_books_refs Cited in Books
308 | pubmed_cdd Conserved Domain Links
309 | pubmed_clinvar_calculated ClinVar (calculated)
310 | pubmed_clinvar ClinVar
311 | pubmed_dbvar dbVar
312 | pubmed_gap dbGaP Links
313 | pubmed_gds GEO DataSet Links
314 | pubmed_gene_bookrecords Gene (from Bookshelf)
315 | pubmed_gene_citedinomim Gene (OMIM) Links
316 | pubmed_gene Gene Links
317 | pubmed_gene_pmc_nucleotide Gene (nucleotide/PMC)
318 | pubmed_gene_rif Gene (GeneRIF) Links
319 | pubmed_genome Genome Links
320 | pubmed_geoprofiles GEO Profile Links
321 | pubmed_homologene HomoloGene Links
322 | pubmed_medgen_bookshelf_cited MedGen (Bookshelf cited)
323 | pubmed_medgen_genereviews MedGen (GeneReviews)
324 | pubmed_medgen MedGen
325 | pubmed_medgen_omim MedGen (OMIM)
326 | pubmed_nuccore Nucleotide Links
327 | pubmed_nuccore_refseq Nucleotide (RefSeq) Links
328 | pubmed_nuccore_weighted Nucleotide (Weighted) Links
329 | pubmed_omim_bookrecords OMIM (from Bookshelf)
330 | pubmed_omim_calculated OMIM (calculated) Links
331 | pubmed_omim_cited OMIM (cited) Links
332 | pubmed_pcassay PubChem BioAssay
333 | pubmed_pccompound_mesh PubChem Compound (MeSH Keyword)
334 | pubmed_pccompound PubChem Compound
335 | pubmed_pccompound_publisher PubChem Compound (Publisher)
336 | pubmed_pcsubstance_bookrecords PubChem Substance (from Bookshelf)
337 | pubmed_pcsubstance PubChem Substance Links
338 | pubmed_pcsubstance_publisher PubChem Substance (Publisher)
339 | pubmed_pmc_bookrecords References in PMC for this Bookshelf citation
340 | pubmed_pmc_embargo
341 | pubmed_pmc_local
342 | pubmed_pmc PMC Links
343 | pubmed_pmc_refs Cited in PMC
344 | pubmed_popset PopSet Links
345 | pubmed_probe Probe Links
346 | pubmed_proteinclusters Protein Cluster Links
347 | pubmed_protein Protein Links
348 | pubmed_protein_refseq Protein (RefSeq) Links
349 | pubmed_protein_weighted Protein (Weighted) Links
350 | pubmed_protfam Protein Family Models
351 | pubmed_pubmed_alsoviewed Articles frequently viewed together
352 | pubmed_pubmed_bookrecords References for this Bookshelf citation
353 | pubmed_pubmed_refs References for PMC Articles
354 | pubmed_pubmed Similar articles
355 | pubmed_snp_cited SNP (Cited)
356 | pubmed_snp SNP Links
357 | pubmed_sra SRA Links
358 | pubmed_structure Structure Links
359 | pubmed_taxonomy_entrez Taxonomy via GenBank
360 |
361 | ```
362 |
363 | And here is an example PubMed article record in abstract form:
364 |
365 | ```console
366 |
367 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17630804"[PMID] |\
368 | > efetch -format abstract
369 |
370 | 1. J Org Chem. 2007 Aug 17;72(17):6621-3. Epub 2007 Jul 14.
371 |
372 | Total synthesis and absolute configuration determination of (+)-bruguierol C.
373 |
374 | Solorio DM(1), Jennings MP.
375 |
376 | Author information:
377 | (1)Department of Chemistry, 500 Campus Drive, The University of Alabama,
378 | Tuscaloosa, Alabama 35487-0336, USA.
379 |
380 | The first total synthesis and absolute configuration of bruguierol C are
381 | reported. The key step involved the diastereoselective capture of an in situ
382 | generated oxocarbenium ion via an intramolecular Friedel-Crafts alkylation.
383 |
384 | DOI: 10.1021/jo071035l
385 | PMID: 17630804 [Indexed for MEDLINE]
386 |
387 | ```
388 |
389 | and the same record in XML format:
390 |
391 | ```console
392 |
393 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17630804"[PMID] |\
394 | > efetch -format xml
395 |
396 |
397 |
398 |
399 |
400 | 17630804
401 |
402 | 2007
403 | 10
404 | 25
405 |
406 |
407 | 2007
408 | 08
409 | 10
410 |
411 |
412 |
413 | 0022-3263
414 |
415 | 72
416 | 17
417 |
418 | 2007
419 | Aug
420 | 17
421 |
422 |
423 | The Journal of organic chemistry
424 | J Org Chem
425 |
426 | Total synthesis and absolute configuration determination of (+)-bruguierol C.
427 |
428 | 6621-3
429 |
430 |
431 | The first total synthesis and absolute configuration of bruguierol C are reported. The key step involved the diastereoselective capture of an in situ generated oxocarbenium ion via an intramolecular Friedel-Crafts alkylation.
432 |
433 |
434 |
435 | Solorio
436 | Dionicio Martinez
437 | DM
438 |
439 | Department of Chemistry, 500 Campus Drive, The University of Alabama, Tuscaloosa, Alabama 35487-0336, USA.
440 |
441 |
442 |
443 | Jennings
444 | Michael P
445 | MP
446 |
447 |
448 | eng
449 |
450 | Journal Article
451 | Research Support, Non-U.S. Gov't
452 | Research Support, U.S. Gov't, Non-P.H.S.
453 |
454 |
455 | 2007
456 | 07
457 | 14
458 |
459 |
460 |
461 | United States
462 | J Org Chem
463 | 2985193R
464 | 0022-3263
465 |
466 |
467 |
468 | 0
469 | Heterocyclic Compounds, 3-Ring
470 |
471 |
472 | 0
473 | bruguierol C
474 |
475 |
476 | IM
477 |
478 |
479 | Heterocyclic Compounds, 3-Ring
480 | chemical synthesis
481 | chemistry
482 |
483 |
484 | Magnetic Resonance Spectroscopy
485 |
486 |
487 | Molecular Structure
488 |
489 |
490 | Spectrometry, Mass, Electrospray Ionization
491 |
492 |
493 | Spectrophotometry, Infrared
494 |
495 |
496 | Stereoisomerism
497 |
498 |
499 |
500 |
501 |
502 |
503 | 2007
504 | 7
505 | 17
506 | 9
507 | 0
508 |
509 |
510 | 2007
511 | 10
512 | 27
513 | 9
514 | 0
515 |
516 |
517 | 2007
518 | 7
519 | 17
520 | 9
521 | 0
522 |
523 |
524 | ppublish
525 |
526 | 17630804
527 | 10.1021/jo071035l
528 |
529 |
530 |
531 |
532 |
533 | ```
534 | The above returned XML PubMed record is hard to understand as it has many fields. We can use the `xtract -outline` argument to present a structured view of only the XML data tags:
535 |
536 | ```console
537 |
538 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17630804"[PMID] | \
539 | > efetch -format xml | \
540 | > xtract -outline
541 | PubmedArticle
542 | MedlineCitation
543 | PMID
544 | DateCompleted
545 | Year
546 | Month
547 | Day
548 | DateRevised
549 | Year
550 | Month
551 | Day
552 | Article
553 | Journal
554 | ISSN
555 | JournalIssue
556 | Volume
557 | Issue
558 | PubDate
559 | Year
560 | Month
561 | Day
562 | Title
563 | ISOAbbreviation
564 | ArticleTitle
565 | Pagination
566 | MedlinePgn
567 | Abstract
568 | AbstractText
569 | AuthorList
570 | Author
571 | LastName
572 | ForeName
573 | Initials
574 | AffiliationInfo
575 | Affiliation
576 | Author
577 | LastName
578 | ForeName
579 | Initials
580 | Language
581 | PublicationTypeList
582 | PublicationType
583 | PublicationType
584 | PublicationType
585 | ArticleDate
586 | Year
587 | Month
588 | Day
589 | MedlineJournalInfo
590 | Country
591 | MedlineTA
592 | NlmUniqueID
593 | ISSNLinking
594 | ChemicalList
595 | Chemical
596 | RegistryNumber
597 | NameOfSubstance
598 | Chemical
599 | RegistryNumber
600 | NameOfSubstance
601 | CitationSubset
602 | MeshHeadingList
603 | MeshHeading
604 | DescriptorName
605 | QualifierName
606 | QualifierName
607 | MeshHeading
608 | DescriptorName
609 | MeshHeading
610 | DescriptorName
611 | MeshHeading
612 | DescriptorName
613 | MeshHeading
614 | DescriptorName
615 | MeshHeading
616 | DescriptorName
617 | PubmedData
618 | History
619 | PubMedPubDate
620 | Year
621 | Month
622 | Day
623 | Hour
624 | Minute
625 | PubMedPubDate
626 | Year
627 | Month
628 | Day
629 | Hour
630 | Minute
631 | PubMedPubDate
632 | Year
633 | Month
634 | Day
635 | Hour
636 | Minute
637 | PublicationStatus
638 | ArticleIdList
639 | ArticleId
640 | ArticleId
641 |
642 | ```
643 |
644 | The above structured output makes it easier to view the XML formatting and determine which data elements we are interested in extracting out with `xtract`, such as the PMID, Author/LastName, Author/Initials, ISOAbbreviation, ArticleTitle, PubDate, Volume, Issue, and MedlinePgn:
645 |
646 | ```console
647 |
648 | user@computer:~$ esearch -email name@xx.edu -db pubmed -query "17630804"[PMID] | \
649 | > efetch -format xml | \
650 | > xtract -pattern PubmedArticle -element MedlineCitation/PMID -first Author/LastName Author/Initials \
651 | > ISOAbbreviation ArticleTitle PubDate/Year Volume Issue MedlinePgn
652 | 17630804 Solorio DM J. Org. Chem. Total synthesis and absolute configuration determination of (+)-bruguierol C. 2007 72 17 6621-3
653 |
654 | ```
655 | Note that in the above `xtract` argument, the element selection is specified with `-first`, so that only the first occurrence is extracted (e.g., first Author).
656 |
657 | ## PubChem BioAssay EDirect Fields, Links, and Data
658 |
659 | Lastly, let's take a look at the PubChem BioAssay database indexed fields, links, and data structure:
660 |
661 | ```console
662 |
663 | user@computer:~$ einfo -email name@xx.edu -db pcassay -fields
664 | AC Active Sid Count
665 | ACMD Activity Outcome Method
666 | ACMT Assay Comment
667 | ADES Assay Description
668 | ALL All Fields
669 | ANAM Assay Name
670 | APRJ Assay Project
671 | APRL Assay Protocol
672 | ASRD Assay Source ID
673 | BSID BioSystems ID
674 | CCMT Categorized Comment
675 | CCT Categorized Comment Title
676 | CELL Cell Line
677 | CSNM Current Source Name
678 | DDAT Deposit Date
679 | DTMD Detection Method
680 | FILT Filter
681 | GBAC GenBank Accession
682 | GRN Grant Number
683 | GSYM Gene Symbol
684 | HDAT Hold Until Date
685 | JNAM Journal Name
686 | MDAT Modify Date
687 | NARD Nucleic Acid Reagent ID
688 | NSAM Number of Sids With Activity Concentration micromolar
689 | NSAN Number of Sids With Activity Concentration nanomolar
690 | ORGN Organism
691 | PCC Probe Cid Count
692 | PIGI Pig GI
693 | PSC Probe Sid Count
694 | PTGI Protein Target GI
695 | PTN Protein Target Name
696 | RTGI RNA Target GI
697 | SNME Source Name
698 | SRCC Source Category
699 | SYNT Synonym Tested
700 | TCNT Target Count
701 | TSC Total Sid Count
702 | TXNM Taxonomy Name
703 | UID Assay ID
704 | UPAC UniProt Accession
705 |
706 | user@computer:~$ einfo -email name@xx.edu -db pcassay -links
707 | pcassay_books_probe MLP Chemical Probe Report
708 | pcassay_cdd_protein_target Conserved Domains (Full) via Protein Target
709 | pcassay_gene_rnai RNAi Target, Tested
710 | pcassay_gene_rnai_active RNAi Target, Active
711 | pcassay_gene_target Gene Target
712 | pcassay_nuccore Nucleotide
713 | pcassay_nuccore_rna_target Nucleotide RNA Target
714 | pcassay_omim OMIM
715 | pcassay_pcassay_activityneighbor_list Related BioAssays, by Activity Overlap (List)
716 | pcassay_pcassay_assay_project Related Assay Projects
717 | pcassay_pcassay_common_gene_list Related BioAssays, by Common Active Gene (List)
718 | pcassay_pcassay_gene_interaction_list Related BioAssays, by Gene Interaction (List)
719 | pcassay_pcassay_neighbor_list Related BioAssays, by Depositor (List)
720 | pcassay_pcassay_same_assay_project_list Related BioAssays, by Same Project (List)
721 | pcassay_pcassay_same_publication_list Related BioAssays, by Same Publication (List)
722 | pcassay_pcassay_similar_publication_list Related BioAssays, by Similar Publication (List)
723 | pcassay_pcassay_targetneighbor_list Related BioAssays, by Target Similarity (List)
724 | pcassay_pccompound Compounds
725 | pcassay_pccompound_active Compounds, Active
726 | pcassay_pccompound_activityconcmicromolar Compounds, activity concentration at/below 1 uM
727 | pcassay_pccompound_activityconcnanomolar Compounds, activity concentration at/below 1 nM
728 | pcassay_pccompound_inactive Compounds, Inactive
729 | pcassay_pccompound_probe Compounds, Probe
730 | pcassay_pcsubstance Substances
731 | pcassay_pcsubstance_active Substances, Active
732 | pcassay_pcsubstance_activityconcmicromolar Substances, activity concentration at/below 1 uM
733 | pcassay_pcsubstance_activityconcnanomolar Substances, activity concentration at/below 1 nM
734 | pcassay_pcsubstance_inactive Substances, Inactive
735 | pcassay_pcsubstance_probe Substances, Probe
736 | pcassay_pmc PMC Articles
737 | pcassay_probe Nucleic acid reagent
738 | pcassay_protein_target Protein Target
739 | pcassay_protein_target_pig Protein Target, Identical Sequence
740 | pcassay_pubmed PubMed Citations
741 | pcassay_sparcle_target Target Functional Class
742 | pcassay_structure Protein Structures
743 | pcassay_taxonomy Taxonomy
744 | ```
745 |
746 | And here is an example PubChem BioAssay record in docsum XML:
747 |
748 | ```console
749 |
750 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "1236573"[UID] | \
751 | > efetch -format docsum
752 |
753 |
754 |
755 |
756 | Build200617-0952.1
757 |
758 | 1236573
759 | Cytotoxicity against human NCI/ADR cells
760 |
761 | Title: Progress Toward the Development of Noscapine and Derivatives as Anticancer Agents. Abstract: Many nitrogen-moiety containing alkaloids derived from plant origins are bioactive and play a significant role in human health and emerging medicine. Noscapine, a phthalideisoquinoline alkaloid derived from Papaver somniferum, has been used as a cough suppressant since the mid 1950s, illustrating a good safety profile. Noscapine has since been discovered to arrest cells at mitosis, albeit with moderately weak activity. Immunofluorescence staining of microtubules after 24 h of noscapine exposure at 20 inverted question markM elucidated chromosomal abnormalities and the inability of chromosomes to complete congression to the equatorial plane for proper mitotic separation ( Proc. Natl. Acad. Sci. U. S. A. 1998 , 95 , 1601 - 1606 ). A number of noscapine analogues possessing various modifications have been described within the literature and have shown significantly improved antiprolific profiles for a large variety of cancer cell lines. Several semisynthetic antimitotic alkaloids are emerging as possible candidates as novel anticancer therapies. This perspective discusses the advancing understanding of noscapine and related analogues in the fight against malignant disease.
762 | 1506980
763 |
764 | ChEMBL
765 |
766 | ChEMBL
767 |
768 | 1
769 | Confirmatory
770 | 1
771 | No
772 | 2018/10/08 00:00
773 | 2016/12/22 00:00
774 | 1/01/01 00:00
775 | 1236573
776 | 0
777 | 0
778 | 1
779 | 1
780 |
781 |
782 |
783 |
784 |
785 | ```
786 | Similarly to the PubChem Compound and PubMed data, we can extract out specific data using the `xtract` function such as the AID, CurrentSourceName, AssayName, ActiveSidCount, and TargetCount:
787 |
788 | ```console
789 |
790 | user@computer:~$ esearch -email name@xx.edu -db pcassay -query "1236573"[UID] | \
791 | > efetch -format docsum | \
792 | > xtract -pattern DocumentSummary -element AID CurrentSourceName AssayName ActiveSidCount TargetCount
793 | 1236573 ChEMBL Cytotoxicity against human NCI/ADR cells 1 0
794 | ```
795 |
796 |
--------------------------------------------------------------------------------