├── .gitignore
├── Identify somatic mutations in cancer exome without a matched pair
├── LICENSE
├── README.md
├── blog_draft.txt
├── kegg-information.md
├── lncRNA-record.txt
├── public-mutation-database
└── translate_some_blogs


/.gitignore:
--------------------------------------------------------------------------------
 1 | # History files
 2 | .Rhistory
 3 | .Rapp.history
 4 | 
 5 | # Session Data files
 6 | .RData
 7 | 
 8 | # Example code in package build process
 9 | *-Ex.R
10 | 
11 | # RStudio files
12 | .Rproj.user/
13 | 
14 | # produced vignettes
15 | vignettes/*.html
16 | vignettes/*.pdf
17 | 
18 | # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
19 | .httr-oauth
20 | 


--------------------------------------------------------------------------------
/Identify somatic mutations in cancer exome without a matched pair:
--------------------------------------------------------------------------------
 1 | Identify somatic mutations in cancer exome without a matched pair
 2 | Hello,
 3 | Does anyone know of an acceptable workflow/pipeline for processing cancer exome data without a matched normal?  I understand that without the normal, we won't be able to distinguish between germline and somatic mutations with 100% certainty.  Even so, I would like to process our data.  I have already prepped my data using the steps outlined GATK best practices (align, sort, remove dups, index, indel realign, base recalibration).  What is the next step (software, analysis, etc)?
 4 | Thanks for your help!
 5 | FYI, while I was searching for answers I came across this post:
 6 | Discrimination Between Germline And Somatic Mutations In Tumor Without The Availability Of The Normal Paired Sample
 7 | There is some useful info here.  However, it is from almost 2 years ago.  I am wondering if there is a more updated, streamlined process?
 8 | Thanks.
 9 | I have been working on something similar (using genomes).
10 | Here is what I've learned so far:
11 | -This is unfortunately more difficult than I originally hoped.   For example, general knowledge is that variants found in COSMIC will be somatic in other samples, when in practice I've flagged many variants as somatic for that reason, only to find later that they are germline.   One metric that I've found relatively useful when comparing against COSMIC is to limit the trustworthy somatic calls to those that are identified in a minimum number of studies (there are lots that are found in only 1 study).   However, the business of using a database of somatic calls to select the somatic calls from a germline set has not been very successful for me.
12 | -Filtering out all the variants listed in dbSNP 144 (the latest on hg19) is very helpful.   This release now includes data from 1000 genomes as well as ExAC -> all rich germline data sets.   In my experience you need to be careful filtering out all variants seen in ExAC, and its better to not filter some that are at really low frequencies.
13 | -Be careful with the dbSNP filtering.   There are many real somatic variants in there.  For example, it seems all somatic variants found in COLO-829 have been flagged as somatic in dbSNP (using the SAO field).  Unfortunately, somatic variants found outside of published cell lines are not as likely to be marked as somatic in dbSNP.   In fact I did my initial testing using COLO-829 only to learn later that although dbSNP is so precise with its somatic annotations of COLO-829 variants, it it very hit or miss (mostly miss) for somatic variants identified in real cancer samples.
14 | -Be careful with over filtering.   I have found that the germline filtering works relatively well, but there are many cases where a known hotspot mutation (PI3KCA, or BRCA2, for example) is listed in dbSNP and not marked as somatic. 
15 | Throwing everything together I'm able to get about 80% sensitivity and 20% specificity in classifying a set of  (coding) variants as germline or somatic.
16 | Since you are asking about cell lines, maybe these have already been sequenced. There are bigger studies I'm aware of:
17 | 1.	Cancer Cell Line Encyclopedia (CCLE), see http://www.ncbi.nlm.nih.gov/pubmed/22460905
18 | Browse and download the data: http://www.broadinstitute.org/ccle/home
19 | 2.	NCI-60 cell line, see e.g. here: http://www.cbioportal.org/public-portal/study.do?cancer_study_id=cellline_nci60
20 | Then I would check against COSMIC for known somatic mutations. And maybe the cell line in question even has data in COSMIC, see By Sample at http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/ 
21 | Yes, I would also say it's close to impossible to distinguish between novel somatic and private or rare germline variants that are not dbSNP/1000genomes/ESP6500. Looking at both mutation allele frequency and copy number at that position might help for cases where a somatic mutation occurred after a copy number change, but it's still rather guessing than identifying. 
22 | Here is what I do:
23 | 1.	Flag known germline variants by looking in dbSNP. I use a subset of dbSNP (> 1% minor allele frequency, mapping only once to reference assembly, and not flagged as "clinically associated"). You can get such a file for ANNOVAR (database name is snp137NonFlagged for the current dbSNP build), seehttp://www.openbioinformatics.org/annovar/annovar_download.html
24 | 2.	Flag known somatic variants by looking in COSMIC. This usually finds well-described hotspot mutations (such as activating KRAS mutations), but overall will not find most of your true somatic variants (my guess). I usually take the whole of COSMIC, irrespective of tumor type.
25 | 3.	Add other cancer sequencing studies (e.g. TCGA), as many of these are not yet in COSMIC currently. For TCGA, I use the MAF files available at https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/. Level 3 MAF files contain experimentally validated somatic mutations only. Level 2 MAF files contain also the unvalidated ones (and can contain germline variants).
26 | 4.	Look at the variant allele frequency. If it's 100%, i.e. all reads show the variant, it's very likely germline (unless your tumor sample is 100% tumor cells and all tumor cells have the mutation). If it's below 10%, it can well be an artifact, see e.g.http://www.ncbi.nlm.nih.gov/pubmed/23303777
27 | 5.	Check how all of the mismatches in your data (non-reference bases in the alignment) are distributed along the reads from 5' to 3'. If you have a much higher mismatch rate at the first/last bases of your reads, you might want to exclude these read positions.
28 | 6.	Filter your variant list further, as it will likely contain a considerable amount of false positives. Table 1 of the VarScan paperhttp://www.ncbi.nlm.nih.gov/pubmed/22300766 is a good start (read pos, strand, variant read number and frequency, distance to 3', homopolymer, map quality and read length difference).
29 | 7.	Looking at already known cancer mutation is fine, but you can tell only about what it is already known.
30 | 8.	Personally, I would look at frequency of mutations. If it is germline it is either 100% or 50% (clearly, not exactly50%, but around there).
31 | 9.	If it is a somatic mutation and your samples are from clinical samples (not cell lines), then infiltration with normal cells is inevitable and your mutations will be at 30-40%
32 | 10.	If coverage is enough, you might confidently distinguish between the two.
33 | 11.	To better understan what I mean, I suggest you this great paper
34 | 
35 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                     GNU GENERAL PUBLIC LICENSE
  2 |                        Version 3, 29 June 2007
  3 | 
  4 |  Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
  5 |  Everyone is permitted to copy and distribute verbatim copies
  6 |  of this license document, but changing it is not allowed.
  7 | 
  8 |                             Preamble
  9 | 
 10 |   The GNU General Public License is a free, copyleft license for
 11 | software and other kinds of works.
 12 | 
 13 |   The licenses for most software and other practical works are designed
 14 | to take away your freedom to share and change the works.  By contrast,
 15 | the GNU General Public License is intended to guarantee your freedom to
 16 | share and change all versions of a program--to make sure it remains free
 17 | software for all its users.  We, the Free Software Foundation, use the
 18 | GNU General Public License for most of our software; it applies also to
 19 | any other work released this way by its authors.  You can apply it to
 20 | your programs, too.
 21 | 
 22 |   When we speak of free software, we are referring to freedom, not
 23 | price.  Our General Public Licenses are designed to make sure that you
 24 | have the freedom to distribute copies of free software (and charge for
 25 | them if you wish), that you receive source code or can get it if you
 26 | want it, that you can change the software or use pieces of it in new
 27 | free programs, and that you know you can do these things.
 28 | 
 29 |   To protect your rights, we need to prevent others from denying you
 30 | these rights or asking you to surrender the rights.  Therefore, you have
 31 | certain responsibilities if you distribute copies of the software, or if
 32 | you modify it: responsibilities to respect the freedom of others.
 33 | 
 34 |   For example, if you distribute copies of such a program, whether
 35 | gratis or for a fee, you must pass on to the recipients the same
 36 | freedoms that you received.  You must make sure that they, too, receive
 37 | or can get the source code.  And you must show them these terms so they
 38 | know their rights.
 39 | 
 40 |   Developers that use the GNU GPL protect your rights with two steps:
 41 | (1) assert copyright on the software, and (2) offer you this License
 42 | giving you legal permission to copy, distribute and/or modify it.
 43 | 
 44 |   For the developers' and authors' protection, the GPL clearly explains
 45 | that there is no warranty for this free software.  For both users' and
 46 | authors' sake, the GPL requires that modified versions be marked as
 47 | changed, so that their problems will not be attributed erroneously to
 48 | authors of previous versions.
 49 | 
 50 |   Some devices are designed to deny users access to install or run
 51 | modified versions of the software inside them, although the manufacturer
 52 | can do so.  This is fundamentally incompatible with the aim of
 53 | protecting users' freedom to change the software.  The systematic
 54 | pattern of such abuse occurs in the area of products for individuals to
 55 | use, which is precisely where it is most unacceptable.  Therefore, we
 56 | have designed this version of the GPL to prohibit the practice for those
 57 | products.  If such problems arise substantially in other domains, we
 58 | stand ready to extend this provision to those domains in future versions
 59 | of the GPL, as needed to protect the freedom of users.
 60 | 
 61 |   Finally, every program is threatened constantly by software patents.
 62 | States should not allow patents to restrict development and use of
 63 | software on general-purpose computers, but in those that do, we wish to
 64 | avoid the special danger that patents applied to a free program could
 65 | make it effectively proprietary.  To prevent this, the GPL assures that
 66 | patents cannot be used to render the program non-free.
 67 | 
 68 |   The precise terms and conditions for copying, distribution and
 69 | modification follow.
 70 | 
 71 |                        TERMS AND CONDITIONS
 72 | 
 73 |   0. Definitions.
 74 | 
 75 |   "This License" refers to version 3 of the GNU General Public License.
 76 | 
 77 |   "Copyright" also means copyright-like laws that apply to other kinds of
 78 | works, such as semiconductor masks.
 79 | 
 80 |   "The Program" refers to any copyrightable work licensed under this
 81 | License.  Each licensee is addressed as "you".  "Licensees" and
 82 | "recipients" may be individuals or organizations.
 83 | 
 84 |   To "modify" a work means to copy from or adapt all or part of the work
 85 | in a fashion requiring copyright permission, other than the making of an
 86 | exact copy.  The resulting work is called a "modified version" of the
 87 | earlier work or a work "based on" the earlier work.
 88 | 
 89 |   A "covered work" means either the unmodified Program or a work based
 90 | on the Program.
 91 | 
 92 |   To "propagate" a work means to do anything with it that, without
 93 | permission, would make you directly or secondarily liable for
 94 | infringement under applicable copyright law, except executing it on a
 95 | computer or modifying a private copy.  Propagation includes copying,
 96 | distribution (with or without modification), making available to the
 97 | public, and in some countries other activities as well.
 98 | 
 99 |   To "convey" a work means any kind of propagation that enables other
100 | parties to make or receive copies.  Mere interaction with a user through
101 | a computer network, with no transfer of a copy, is not conveying.
102 | 
103 |   An interactive user interface displays "Appropriate Legal Notices"
104 | to the extent that it includes a convenient and prominently visible
105 | feature that (1) displays an appropriate copyright notice, and (2)
106 | tells the user that there is no warranty for the work (except to the
107 | extent that warranties are provided), that licensees may convey the
108 | work under this License, and how to view a copy of this License.  If
109 | the interface presents a list of user commands or options, such as a
110 | menu, a prominent item in the list meets this criterion.
111 | 
112 |   1. Source Code.
113 | 
114 |   The "source code" for a work means the preferred form of the work
115 | for making modifications to it.  "Object code" means any non-source
116 | form of a work.
117 | 
118 |   A "Standard Interface" means an interface that either is an official
119 | standard defined by a recognized standards body, or, in the case of
120 | interfaces specified for a particular programming language, one that
121 | is widely used among developers working in that language.
122 | 
123 |   The "System Libraries" of an executable work include anything, other
124 | than the work as a whole, that (a) is included in the normal form of
125 | packaging a Major Component, but which is not part of that Major
126 | Component, and (b) serves only to enable use of the work with that
127 | Major Component, or to implement a Standard Interface for which an
128 | implementation is available to the public in source code form.  A
129 | "Major Component", in this context, means a major essential component
130 | (kernel, window system, and so on) of the specific operating system
131 | (if any) on which the executable work runs, or a compiler used to
132 | produce the work, or an object code interpreter used to run it.
133 | 
134 |   The "Corresponding Source" for a work in object code form means all
135 | the source code needed to generate, install, and (for an executable
136 | work) run the object code and to modify the work, including scripts to
137 | control those activities.  However, it does not include the work's
138 | System Libraries, or general-purpose tools or generally available free
139 | programs which are used unmodified in performing those activities but
140 | which are not part of the work.  For example, Corresponding Source
141 | includes interface definition files associated with source files for
142 | the work, and the source code for shared libraries and dynamically
143 | linked subprograms that the work is specifically designed to require,
144 | such as by intimate data communication or control flow between those
145 | subprograms and other parts of the work.
146 | 
147 |   The Corresponding Source need not include anything that users
148 | can regenerate automatically from other parts of the Corresponding
149 | Source.
150 | 
151 |   The Corresponding Source for a work in source code form is that
152 | same work.
153 | 
154 |   2. Basic Permissions.
155 | 
156 |   All rights granted under this License are granted for the term of
157 | copyright on the Program, and are irrevocable provided the stated
158 | conditions are met.  This License explicitly affirms your unlimited
159 | permission to run the unmodified Program.  The output from running a
160 | covered work is covered by this License only if the output, given its
161 | content, constitutes a covered work.  This License acknowledges your
162 | rights of fair use or other equivalent, as provided by copyright law.
163 | 
164 |   You may make, run and propagate covered works that you do not
165 | convey, without conditions so long as your license otherwise remains
166 | in force.  You may convey covered works to others for the sole purpose
167 | of having them make modifications exclusively for you, or provide you
168 | with facilities for running those works, provided that you comply with
169 | the terms of this License in conveying all material for which you do
170 | not control copyright.  Those thus making or running the covered works
171 | for you must do so exclusively on your behalf, under your direction
172 | and control, on terms that prohibit them from making any copies of
173 | your copyrighted material outside their relationship with you.
174 | 
175 |   Conveying under any other circumstances is permitted solely under
176 | the conditions stated below.  Sublicensing is not allowed; section 10
177 | makes it unnecessary.
178 | 
179 |   3. Protecting Users' Legal Rights From Anti-Circumvention Law.
180 | 
181 |   No covered work shall be deemed part of an effective technological
182 | measure under any applicable law fulfilling obligations under article
183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or
184 | similar laws prohibiting or restricting circumvention of such
185 | measures.
186 | 
187 |   When you convey a covered work, you waive any legal power to forbid
188 | circumvention of technological measures to the extent such circumvention
189 | is effected by exercising rights under this License with respect to
190 | the covered work, and you disclaim any intention to limit operation or
191 | modification of the work as a means of enforcing, against the work's
192 | users, your or third parties' legal rights to forbid circumvention of
193 | technological measures.
194 | 
195 |   4. Conveying Verbatim Copies.
196 | 
197 |   You may convey verbatim copies of the Program's source code as you
198 | receive it, in any medium, provided that you conspicuously and
199 | appropriately publish on each copy an appropriate copyright notice;
200 | keep intact all notices stating that this License and any
201 | non-permissive terms added in accord with section 7 apply to the code;
202 | keep intact all notices of the absence of any warranty; and give all
203 | recipients a copy of this License along with the Program.
204 | 
205 |   You may charge any price or no price for each copy that you convey,
206 | and you may offer support or warranty protection for a fee.
207 | 
208 |   5. Conveying Modified Source Versions.
209 | 
210 |   You may convey a work based on the Program, or the modifications to
211 | produce it from the Program, in the form of source code under the
212 | terms of section 4, provided that you also meet all of these conditions:
213 | 
214 |     a) The work must carry prominent notices stating that you modified
215 |     it, and giving a relevant date.
216 | 
217 |     b) The work must carry prominent notices stating that it is
218 |     released under this License and any conditions added under section
219 |     7.  This requirement modifies the requirement in section 4 to
220 |     "keep intact all notices".
221 | 
222 |     c) You must license the entire work, as a whole, under this
223 |     License to anyone who comes into possession of a copy.  This
224 |     License will therefore apply, along with any applicable section 7
225 |     additional terms, to the whole of the work, and all its parts,
226 |     regardless of how they are packaged.  This License gives no
227 |     permission to license the work in any other way, but it does not
228 |     invalidate such permission if you have separately received it.
229 | 
230 |     d) If the work has interactive user interfaces, each must display
231 |     Appropriate Legal Notices; however, if the Program has interactive
232 |     interfaces that do not display Appropriate Legal Notices, your
233 |     work need not make them do so.
234 | 
235 |   A compilation of a covered work with other separate and independent
236 | works, which are not by their nature extensions of the covered work,
237 | and which are not combined with it such as to form a larger program,
238 | in or on a volume of a storage or distribution medium, is called an
239 | "aggregate" if the compilation and its resulting copyright are not
240 | used to limit the access or legal rights of the compilation's users
241 | beyond what the individual works permit.  Inclusion of a covered work
242 | in an aggregate does not cause this License to apply to the other
243 | parts of the aggregate.
244 | 
245 |   6. Conveying Non-Source Forms.
246 | 
247 |   You may convey a covered work in object code form under the terms
248 | of sections 4 and 5, provided that you also convey the
249 | machine-readable Corresponding Source under the terms of this License,
250 | in one of these ways:
251 | 
252 |     a) Convey the object code in, or embodied in, a physical product
253 |     (including a physical distribution medium), accompanied by the
254 |     Corresponding Source fixed on a durable physical medium
255 |     customarily used for software interchange.
256 | 
257 |     b) Convey the object code in, or embodied in, a physical product
258 |     (including a physical distribution medium), accompanied by a
259 |     written offer, valid for at least three years and valid for as
260 |     long as you offer spare parts or customer support for that product
261 |     model, to give anyone who possesses the object code either (1) a
262 |     copy of the Corresponding Source for all the software in the
263 |     product that is covered by this License, on a durable physical
264 |     medium customarily used for software interchange, for a price no
265 |     more than your reasonable cost of physically performing this
266 |     conveying of source, or (2) access to copy the
267 |     Corresponding Source from a network server at no charge.
268 | 
269 |     c) Convey individual copies of the object code with a copy of the
270 |     written offer to provide the Corresponding Source.  This
271 |     alternative is allowed only occasionally and noncommercially, and
272 |     only if you received the object code with such an offer, in accord
273 |     with subsection 6b.
274 | 
275 |     d) Convey the object code by offering access from a designated
276 |     place (gratis or for a charge), and offer equivalent access to the
277 |     Corresponding Source in the same way through the same place at no
278 |     further charge.  You need not require recipients to copy the
279 |     Corresponding Source along with the object code.  If the place to
280 |     copy the object code is a network server, the Corresponding Source
281 |     may be on a different server (operated by you or a third party)
282 |     that supports equivalent copying facilities, provided you maintain
283 |     clear directions next to the object code saying where to find the
284 |     Corresponding Source.  Regardless of what server hosts the
285 |     Corresponding Source, you remain obligated to ensure that it is
286 |     available for as long as needed to satisfy these requirements.
287 | 
288 |     e) Convey the object code using peer-to-peer transmission, provided
289 |     you inform other peers where the object code and Corresponding
290 |     Source of the work are being offered to the general public at no
291 |     charge under subsection 6d.
292 | 
293 |   A separable portion of the object code, whose source code is excluded
294 | from the Corresponding Source as a System Library, need not be
295 | included in conveying the object code work.
296 | 
297 |   A "User Product" is either (1) a "consumer product", which means any
298 | tangible personal property which is normally used for personal, family,
299 | or household purposes, or (2) anything designed or sold for incorporation
300 | into a dwelling.  In determining whether a product is a consumer product,
301 | doubtful cases shall be resolved in favor of coverage.  For a particular
302 | product received by a particular user, "normally used" refers to a
303 | typical or common use of that class of product, regardless of the status
304 | of the particular user or of the way in which the particular user
305 | actually uses, or expects or is expected to use, the product.  A product
306 | is a consumer product regardless of whether the product has substantial
307 | commercial, industrial or non-consumer uses, unless such uses represent
308 | the only significant mode of use of the product.
309 | 
310 |   "Installation Information" for a User Product means any methods,
311 | procedures, authorization keys, or other information required to install
312 | and execute modified versions of a covered work in that User Product from
313 | a modified version of its Corresponding Source.  The information must
314 | suffice to ensure that the continued functioning of the modified object
315 | code is in no case prevented or interfered with solely because
316 | modification has been made.
317 | 
318 |   If you convey an object code work under this section in, or with, or
319 | specifically for use in, a User Product, and the conveying occurs as
320 | part of a transaction in which the right of possession and use of the
321 | User Product is transferred to the recipient in perpetuity or for a
322 | fixed term (regardless of how the transaction is characterized), the
323 | Corresponding Source conveyed under this section must be accompanied
324 | by the Installation Information.  But this requirement does not apply
325 | if neither you nor any third party retains the ability to install
326 | modified object code on the User Product (for example, the work has
327 | been installed in ROM).
328 | 
329 |   The requirement to provide Installation Information does not include a
330 | requirement to continue to provide support service, warranty, or updates
331 | for a work that has been modified or installed by the recipient, or for
332 | the User Product in which it has been modified or installed.  Access to a
333 | network may be denied when the modification itself materially and
334 | adversely affects the operation of the network or violates the rules and
335 | protocols for communication across the network.
336 | 
337 |   Corresponding Source conveyed, and Installation Information provided,
338 | in accord with this section must be in a format that is publicly
339 | documented (and with an implementation available to the public in
340 | source code form), and must require no special password or key for
341 | unpacking, reading or copying.
342 | 
343 |   7. Additional Terms.
344 | 
345 |   "Additional permissions" are terms that supplement the terms of this
346 | License by making exceptions from one or more of its conditions.
347 | Additional permissions that are applicable to the entire Program shall
348 | be treated as though they were included in this License, to the extent
349 | that they are valid under applicable law.  If additional permissions
350 | apply only to part of the Program, that part may be used separately
351 | under those permissions, but the entire Program remains governed by
352 | this License without regard to the additional permissions.
353 | 
354 |   When you convey a copy of a covered work, you may at your option
355 | remove any additional permissions from that copy, or from any part of
356 | it.  (Additional permissions may be written to require their own
357 | removal in certain cases when you modify the work.)  You may place
358 | additional permissions on material, added by you to a covered work,
359 | for which you have or can give appropriate copyright permission.
360 | 
361 |   Notwithstanding any other provision of this License, for material you
362 | add to a covered work, you may (if authorized by the copyright holders of
363 | that material) supplement the terms of this License with terms:
364 | 
365 |     a) Disclaiming warranty or limiting liability differently from the
366 |     terms of sections 15 and 16 of this License; or
367 | 
368 |     b) Requiring preservation of specified reasonable legal notices or
369 |     author attributions in that material or in the Appropriate Legal
370 |     Notices displayed by works containing it; or
371 | 
372 |     c) Prohibiting misrepresentation of the origin of that material, or
373 |     requiring that modified versions of such material be marked in
374 |     reasonable ways as different from the original version; or
375 | 
376 |     d) Limiting the use for publicity purposes of names of licensors or
377 |     authors of the material; or
378 | 
379 |     e) Declining to grant rights under trademark law for use of some
380 |     trade names, trademarks, or service marks; or
381 | 
382 |     f) Requiring indemnification of licensors and authors of that
383 |     material by anyone who conveys the material (or modified versions of
384 |     it) with contractual assumptions of liability to the recipient, for
385 |     any liability that these contractual assumptions directly impose on
386 |     those licensors and authors.
387 | 
388 |   All other non-permissive additional terms are considered "further
389 | restrictions" within the meaning of section 10.  If the Program as you
390 | received it, or any part of it, contains a notice stating that it is
391 | governed by this License along with a term that is a further
392 | restriction, you may remove that term.  If a license document contains
393 | a further restriction but permits relicensing or conveying under this
394 | License, you may add to a covered work material governed by the terms
395 | of that license document, provided that the further restriction does
396 | not survive such relicensing or conveying.
397 | 
398 |   If you add terms to a covered work in accord with this section, you
399 | must place, in the relevant source files, a statement of the
400 | additional terms that apply to those files, or a notice indicating
401 | where to find the applicable terms.
402 | 
403 |   Additional terms, permissive or non-permissive, may be stated in the
404 | form of a separately written license, or stated as exceptions;
405 | the above requirements apply either way.
406 | 
407 |   8. Termination.
408 | 
409 |   You may not propagate or modify a covered work except as expressly
410 | provided under this License.  Any attempt otherwise to propagate or
411 | modify it is void, and will automatically terminate your rights under
412 | this License (including any patent licenses granted under the third
413 | paragraph of section 11).
414 | 
415 |   However, if you cease all violation of this License, then your
416 | license from a particular copyright holder is reinstated (a)
417 | provisionally, unless and until the copyright holder explicitly and
418 | finally terminates your license, and (b) permanently, if the copyright
419 | holder fails to notify you of the violation by some reasonable means
420 | prior to 60 days after the cessation.
421 | 
422 |   Moreover, your license from a particular copyright holder is
423 | reinstated permanently if the copyright holder notifies you of the
424 | violation by some reasonable means, this is the first time you have
425 | received notice of violation of this License (for any work) from that
426 | copyright holder, and you cure the violation prior to 30 days after
427 | your receipt of the notice.
428 | 
429 |   Termination of your rights under this section does not terminate the
430 | licenses of parties who have received copies or rights from you under
431 | this License.  If your rights have been terminated and not permanently
432 | reinstated, you do not qualify to receive new licenses for the same
433 | material under section 10.
434 | 
435 |   9. Acceptance Not Required for Having Copies.
436 | 
437 |   You are not required to accept this License in order to receive or
438 | run a copy of the Program.  Ancillary propagation of a covered work
439 | occurring solely as a consequence of using peer-to-peer transmission
440 | to receive a copy likewise does not require acceptance.  However,
441 | nothing other than this License grants you permission to propagate or
442 | modify any covered work.  These actions infringe copyright if you do
443 | not accept this License.  Therefore, by modifying or propagating a
444 | covered work, you indicate your acceptance of this License to do so.
445 | 
446 |   10. Automatic Licensing of Downstream Recipients.
447 | 
448 |   Each time you convey a covered work, the recipient automatically
449 | receives a license from the original licensors, to run, modify and
450 | propagate that work, subject to this License.  You are not responsible
451 | for enforcing compliance by third parties with this License.
452 | 
453 |   An "entity transaction" is a transaction transferring control of an
454 | organization, or substantially all assets of one, or subdividing an
455 | organization, or merging organizations.  If propagation of a covered
456 | work results from an entity transaction, each party to that
457 | transaction who receives a copy of the work also receives whatever
458 | licenses to the work the party's predecessor in interest had or could
459 | give under the previous paragraph, plus a right to possession of the
460 | Corresponding Source of the work from the predecessor in interest, if
461 | the predecessor has it or can get it with reasonable efforts.
462 | 
463 |   You may not impose any further restrictions on the exercise of the
464 | rights granted or affirmed under this License.  For example, you may
465 | not impose a license fee, royalty, or other charge for exercise of
466 | rights granted under this License, and you may not initiate litigation
467 | (including a cross-claim or counterclaim in a lawsuit) alleging that
468 | any patent claim is infringed by making, using, selling, offering for
469 | sale, or importing the Program or any portion of it.
470 | 
471 |   11. Patents.
472 | 
473 |   A "contributor" is a copyright holder who authorizes use under this
474 | License of the Program or a work on which the Program is based.  The
475 | work thus licensed is called the contributor's "contributor version".
476 | 
477 |   A contributor's "essential patent claims" are all patent claims
478 | owned or controlled by the contributor, whether already acquired or
479 | hereafter acquired, that would be infringed by some manner, permitted
480 | by this License, of making, using, or selling its contributor version,
481 | but do not include claims that would be infringed only as a
482 | consequence of further modification of the contributor version.  For
483 | purposes of this definition, "control" includes the right to grant
484 | patent sublicenses in a manner consistent with the requirements of
485 | this License.
486 | 
487 |   Each contributor grants you a non-exclusive, worldwide, royalty-free
488 | patent license under the contributor's essential patent claims, to
489 | make, use, sell, offer for sale, import and otherwise run, modify and
490 | propagate the contents of its contributor version.
491 | 
492 |   In the following three paragraphs, a "patent license" is any express
493 | agreement or commitment, however denominated, not to enforce a patent
494 | (such as an express permission to practice a patent or covenant not to
495 | sue for patent infringement).  To "grant" such a patent license to a
496 | party means to make such an agreement or commitment not to enforce a
497 | patent against the party.
498 | 
499 |   If you convey a covered work, knowingly relying on a patent license,
500 | and the Corresponding Source of the work is not available for anyone
501 | to copy, free of charge and under the terms of this License, through a
502 | publicly available network server or other readily accessible means,
503 | then you must either (1) cause the Corresponding Source to be so
504 | available, or (2) arrange to deprive yourself of the benefit of the
505 | patent license for this particular work, or (3) arrange, in a manner
506 | consistent with the requirements of this License, to extend the patent
507 | license to downstream recipients.  "Knowingly relying" means you have
508 | actual knowledge that, but for the patent license, your conveying the
509 | covered work in a country, or your recipient's use of the covered work
510 | in a country, would infringe one or more identifiable patents in that
511 | country that you have reason to believe are valid.
512 | 
513 |   If, pursuant to or in connection with a single transaction or
514 | arrangement, you convey, or propagate by procuring conveyance of, a
515 | covered work, and grant a patent license to some of the parties
516 | receiving the covered work authorizing them to use, propagate, modify
517 | or convey a specific copy of the covered work, then the patent license
518 | you grant is automatically extended to all recipients of the covered
519 | work and works based on it.
520 | 
521 |   A patent license is "discriminatory" if it does not include within
522 | the scope of its coverage, prohibits the exercise of, or is
523 | conditioned on the non-exercise of one or more of the rights that are
524 | specifically granted under this License.  You may not convey a covered
525 | work if you are a party to an arrangement with a third party that is
526 | in the business of distributing software, under which you make payment
527 | to the third party based on the extent of your activity of conveying
528 | the work, and under which the third party grants, to any of the
529 | parties who would receive the covered work from you, a discriminatory
530 | patent license (a) in connection with copies of the covered work
531 | conveyed by you (or copies made from those copies), or (b) primarily
532 | for and in connection with specific products or compilations that
533 | contain the covered work, unless you entered into that arrangement,
534 | or that patent license was granted, prior to 28 March 2007.
535 | 
536 |   Nothing in this License shall be construed as excluding or limiting
537 | any implied license or other defenses to infringement that may
538 | otherwise be available to you under applicable patent law.
539 | 
540 |   12. No Surrender of Others' Freedom.
541 | 
542 |   If conditions are imposed on you (whether by court order, agreement or
543 | otherwise) that contradict the conditions of this License, they do not
544 | excuse you from the conditions of this License.  If you cannot convey a
545 | covered work so as to satisfy simultaneously your obligations under this
546 | License and any other pertinent obligations, then as a consequence you may
547 | not convey it at all.  For example, if you agree to terms that obligate you
548 | to collect a royalty for further conveying from those to whom you convey
549 | the Program, the only way you could satisfy both those terms and this
550 | License would be to refrain entirely from conveying the Program.
551 | 
552 |   13. Use with the GNU Affero General Public License.
553 | 
554 |   Notwithstanding any other provision of this License, you have
555 | permission to link or combine any covered work with a work licensed
556 | under version 3 of the GNU Affero General Public License into a single
557 | combined work, and to convey the resulting work.  The terms of this
558 | License will continue to apply to the part which is the covered work,
559 | but the special requirements of the GNU Affero General Public License,
560 | section 13, concerning interaction through a network will apply to the
561 | combination as such.
562 | 
563 |   14. Revised Versions of this License.
564 | 
565 |   The Free Software Foundation may publish revised and/or new versions of
566 | the GNU General Public License from time to time.  Such new versions will
567 | be similar in spirit to the present version, but may differ in detail to
568 | address new problems or concerns.
569 | 
570 |   Each version is given a distinguishing version number.  If the
571 | Program specifies that a certain numbered version of the GNU General
572 | Public License "or any later version" applies to it, you have the
573 | option of following the terms and conditions either of that numbered
574 | version or of any later version published by the Free Software
575 | Foundation.  If the Program does not specify a version number of the
576 | GNU General Public License, you may choose any version ever published
577 | by the Free Software Foundation.
578 | 
579 |   If the Program specifies that a proxy can decide which future
580 | versions of the GNU General Public License can be used, that proxy's
581 | public statement of acceptance of a version permanently authorizes you
582 | to choose that version for the Program.
583 | 
584 |   Later license versions may give you additional or different
585 | permissions.  However, no additional obligations are imposed on any
586 | author or copyright holder as a result of your choosing to follow a
587 | later version.
588 | 
589 |   15. Disclaimer of Warranty.
590 | 
591 |   THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
592 | APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
596 | PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
597 | IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
599 | 
600 |   16. Limitation of Liability.
601 | 
602 |   IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
610 | SUCH DAMAGES.
611 | 
612 |   17. Interpretation of Sections 15 and 16.
613 | 
614 |   If the disclaimer of warranty and limitation of liability provided
615 | above cannot be given local legal effect according to their terms,
616 | reviewing courts shall apply local law that most closely approximates
617 | an absolute waiver of all civil liability in connection with the
618 | Program, unless a warranty or assumption of liability accompanies a
619 | copy of the Program in return for a fee.
620 | 
621 |                      END OF TERMS AND CONDITIONS
622 | 
623 |             How to Apply These Terms to Your New Programs
624 | 
625 |   If you develop a new program, and you want it to be of the greatest
626 | possible use to the public, the best way to achieve this is to make it
627 | free software which everyone can redistribute and change under these terms.
628 | 
629 |   To do so, attach the following notices to the program.  It is safest
630 | to attach them to the start of each source file to most effectively
631 | state the exclusion of warranty; and each file should have at least
632 | the "copyright" line and a pointer to where the full notice is found.
633 | 
634 |     {one line to give the program's name and a brief idea of what it does.}
635 |     Copyright (C) {year}  {name of author}
636 | 
637 |     This program is free software: you can redistribute it and/or modify
638 |     it under the terms of the GNU General Public License as published by
639 |     the Free Software Foundation, either version 3 of the License, or
640 |     (at your option) any later version.
641 | 
642 |     This program is distributed in the hope that it will be useful,
643 |     but WITHOUT ANY WARRANTY; without even the implied warranty of
644 |     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
645 |     GNU General Public License for more details.
646 | 
647 |     You should have received a copy of the GNU General Public License
648 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
649 | 
650 | Also add information on how to contact you by electronic and paper mail.
651 | 
652 |   If the program does terminal interaction, make it output a short
653 | notice like this when it starts in an interactive mode:
654 | 
655 |     {project}  Copyright (C) {year}  {fullname}
656 |     This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
657 |     This is free software, and you are welcome to redistribute it
658 |     under certain conditions; type `show c' for details.
659 | 
660 | The hypothetical commands `show w' and `show c' should show the appropriate
661 | parts of the General Public License.  Of course, your program's commands
662 | might be different; for a GUI interface, you would use an "about box".
663 | 
664 |   You should also get your employer (if you work as a programmer) or school,
665 | if any, to sign a "copyright disclaimer" for the program, if necessary.
666 | For more information on this, and how to apply and follow the GNU GPL, see
667 | <http://www.gnu.org/licenses/>.
668 | 
669 |   The GNU General Public License does not permit incorporating your program
670 | into proprietary programs.  If your program is a subroutine library, you
671 | may consider it more useful to permit linking proprietary applications with
672 | the library.  If this is what you want to do, use the GNU Lesser General
673 | Public License instead of this License.  But first, please read
674 | <http://www.gnu.org/philosophy/why-not-lgpl.html>.
675 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 生物信息学常见网站收藏夹
2 | ## 综合性网站(NCBI,ENSEMBL,UCSC)
3 | ## 
4 | 
5 | 


--------------------------------------------------------------------------------
/blog_draft.txt:
--------------------------------------------------------------------------------
 1 | http://www.compgenome.org/TCGA-Assembler/documents/TCGA-Assembler%20User%20Manual.pdf
 2 | 
 3 | 比对的bam文件被转为bed格式的文件，这样就只需要测序深度信息，不需要care序列具体是什么
 4 | http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html
 5 | sam/bam/bed格式的比对文件转为bedGraph格式文件：
 6 | http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html 
 7 | bedtools genomecov  -bg -i E001-H3K4me1.tagAlign -g mygenome.txt >E001-H3K4me1.bedGraph
 8 | bedtools genomecov  -bg -i E001-Input.tagAlign -g mygenome.txt >E001-Input.bedGraph
 9 | 
10 | ls *gz |xargs gunzip 
11 | 
12 |  1.5G Dec 29  2011 BI.ES-I3.H3K4me1.Lib_MC_20100211_02--ChIP_MC_20100208_02_hES_I3_TESR_H3K4Me1.bed
13 |  762M Nov 17  2010 BI.ES-I3.H3K4me1.Solexa-15382.bed
14 |   22M Oct 31  2013 E001-H3K4me1.broadPeak
15 |   15M Oct 31  2013 E001-H3K4me1.gappedPeak
16 |   21M Oct  9  2013 E001-H3K4me1.narrowPeak
17 |  942M Oct  7  2013 E001-H3K4me1.tagAlign
18 |  942M Oct  7  2013 E001-Input.tagAlign
19 | 
20 | 
21 | 然后就可以读取peaks来看看测序覆盖度的区别啦
22 | broadPeak=read.table("E001-H3K4me1.broadPeak",stringsAsFactors=F)
23 | gappedPeak=read.table("E001-H3K4me1.gappedPeak",stringsAsFactors=F)
24 | narrowPeak=read.table("E001-H3K4me1.narrowPeak",stringsAsFactors=F)
25 | inputBed=read.table("E001-Input.bedGraph",stringsAsFactors=F)
26 | chipBed=read.table("E001-H3K4me1.bedGraph",stringsAsFactors=F)
27 | library('Sushi')
28 | 
29 | apply(broadPeak[1:500,],1,function(x){
30 |  	chrom            = trimws(x[1])
31 | 	chromstart       = as.numeric(x[2])
32 | 	chromend         = as.numeric(x[3])
33 | 	png( paste0(trimws(x[4]),'_broadPeak.png') )
34 | 	plotBedgraph(inputBed,chrom,chromstart,chromend,color='red')
35 | 	plotBedgraph(chipBed,chrom,chromstart,chromend ,color='blue',
36 | 				 transparency=.50,overlay=TRUE,rescaleoverlay=TRUE)
37 | 	labelgenome(chrom, chromstart,chromend,side=3,n=3,scale="Mb")
38 | 	dev.off()
39 | })
40 | 
41 | 
42 | 
43 | NCBI的参考基因组协会：The Genome Reference Consortium，只有4个物种现在
44 | http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/mouse/data/
45 | 
46 | ## Download and install HTSeq  
47 | ## http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
48 | ## https://pypi.python.org/pypi/HTSeq
49 | cd ~/biosoft
50 | mkdir HTSeq &&  cd HTSeq
51 | wget  ~~~~~~~~~~~~~~~~~~~~~~HTSeq-0.6.1.tar.gz
52 | tar zxvf HTSeq-0.6.1.tar.gz
53 | cd HTSeq-0.6.1
54 | python setup.py install --user 
55 | ## ~/.local/bin/htseq-count  --help
56 | ## ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M1/
57 | ## http://hgdownload-test.cse.ucsc.edu/goldenPath/mm10/liftOver/
58 | GRCm38/mm10 (Dec, 2011) 
59 | ls *bam |while read id;do ( ~/.local/bin/htseq-count  -f bam  $id   genecode/mm9/gencode.vM1.annotation.gtf.gz  1>${id%%.*}.gene.counts ) ;done 
60 | ls *bam |while read id;do ( ~/.local/bin/htseq-count  -f bam -i exon_id  $id   genecode/mm9/gencode.vM1.annotation.gtf.gz  1>${id%%.*}.exon.counts ) ;done 
61 | 
62 | 
63 | 


--------------------------------------------------------------------------------
/kegg-information.md:
--------------------------------------------------------------------------------
   1 | 六个大类	  
   2 |   Metabolism
   3 | 
   4 |  	
   5 |   Genetic Information Processing
   6 | 
   7 |  	
   8 |   Environmental Information Processing
   9 | 
  10 |  	
  11 |   Cellular Processes
  12 | 
  13 |  	
  14 |   Organismal Systems
  15 | 
  16 |  	
  17 |   Human Diseases
  18 | 
  19 | 
  20 | 42个小类别	KO
  21 |   Metabolism
  22 | 
  23 |  	
  24 |     Overview
  25 | 
  26 |  	
  27 |     Carbohydrate metabolism
  28 | 
  29 |  	
  30 |     Energy metabolism
  31 | 
  32 |  	
  33 |     Lipid metabolism
  34 | 
  35 |  	
  36 |     Nucleotide metabolism
  37 | 
  38 |  	
  39 |     Amino acid metabolism
  40 | 
  41 |  	
  42 |     Metabolism of other amino acids
  43 | 
  44 |  	
  45 |     Glycan biosynthesis and metabolism
  46 | 
  47 |  	
  48 |     Metabolism of cofactors and vitamins
  49 | 
  50 |  	
  51 |     Metabolism of terpenoids and polyketides
  52 | 
  53 |  	
  54 |     Biosynthesis of other secondary metabolites
  55 | 
  56 |  	
  57 |     Xenobiotics biodegradation and metabolism
  58 | 
  59 |  	
  60 |     Enzyme families
  61 | 
  62 |  	
  63 |   Genetic Information Processing
  64 | 
  65 |  	
  66 |     Transcription
  67 | 
  68 |  	
  69 |     Translation
  70 | 
  71 |  	
  72 |     Folding, sorting and degradation
  73 | 
  74 |  	
  75 |     Replication and repair
  76 | 
  77 |  	
  78 |     RNA family
  79 | 
  80 |  	
  81 |   Environmental Information Processing
  82 | 
  83 |  	
  84 |     Membrane transport
  85 | 
  86 |  	
  87 |     Signal transduction
  88 | 
  89 |  	
  90 |     Signaling molecules and interaction
  91 | 
  92 |  	
  93 |   Cellular Processes
  94 | 
  95 |  	
  96 |     Transport and catabolism
  97 | 
  98 |  	
  99 |     Cell motility
 100 | 
 101 |  	
 102 |     Cell growth and death
 103 | 
 104 |  	
 105 |     Cellular commiunity
 106 | 
 107 |  	
 108 |   Organismal Systems
 109 | 
 110 |  	
 111 |     Immune system
 112 | 
 113 |  	
 114 |     Endocrine system
 115 | 
 116 |  	
 117 |     Circulatory system
 118 | 
 119 |  	
 120 |     Digestive system
 121 | 
 122 |  	
 123 |     Excretory system
 124 | 
 125 |  	
 126 |     Nervous system
 127 | 
 128 |  	
 129 |     Sensory system
 130 | 
 131 |  	
 132 |     Development
 133 | 
 134 |  	
 135 |     Environmental adaptation
 136 | 
 137 |  	
 138 |   Human Diseases
 139 | 
 140 |  	
 141 |     Cancers
 142 | 
 143 |  	
 144 |     Immune diseases
 145 | 
 146 |  	
 147 |     Neurodegenerative diseases
 148 | 
 149 |  	
 150 |     Substance dependence
 151 | 
 152 |  	
 153 |     Cardiovascular diseases
 154 | 
 155 |  	
 156 |     Endocrine and metabolic diseases
 157 | 
 158 |  	
 159 |     Infectious diseases
 160 | 
 161 |  	
 162 |     Drug resistance
 163 | 
 164 | 
 165 |   	
 166 |   Metabolism
 167 | 
 168 |  	
 169 |     Overview
 170 | 
 171 |       01200 Carbon metabolism [PATH:hsa01200]
 172 | 
 173 |       01210 2-Oxocarboxylic acid metabolism [PATH:hsa01210]
 174 | 
 175 |       01212 Fatty acid metabolism [PATH:hsa01212]
 176 | 
 177 |       01230 Biosynthesis of amino acids [PATH:hsa01230]
 178 | 
 179 |       01220 Degradation of aromatic compounds [PATH:hsa01220]
 180 | 
 181 |  	
 182 |     Carbohydrate metabolism
 183 | 
 184 |       00010 Glycolysis / Gluconeogenesis [PATH:hsa00010]
 185 | 
 186 |       00020 Citrate cycle (TCA cycle) [PATH:hsa00020]
 187 | 
 188 |       00030 Pentose phosphate pathway [PATH:hsa00030]
 189 | 
 190 |       00040 Pentose and glucuronate interconversions [PATH:hsa00040]
 191 | 
 192 |       00051 Fructose and mannose metabolism [PATH:hsa00051]
 193 | 
 194 |       00052 Galactose metabolism [PATH:hsa00052]
 195 | 
 196 |       00053 Ascorbate and aldarate metabolism [PATH:hsa00053]
 197 | 
 198 |       00500 Starch and sucrose metabolism [PATH:hsa00500]
 199 | 
 200 |       00520 Amino sugar and nucleotide sugar metabolism [PATH:hsa00520]
 201 | 
 202 |       00620 Pyruvate metabolism [PATH:hsa00620]
 203 | 
 204 |       00630 Glyoxylate and dicarboxylate metabolism [PATH:hsa00630]
 205 | 
 206 |       00640 Propanoate metabolism [PATH:hsa00640]
 207 | 
 208 |       00650 Butanoate metabolism [PATH:hsa00650]
 209 | 
 210 |       00660 C5-Branched dibasic acid metabolism
 211 | 
 212 |       00562 Inositol phosphate metabolism [PATH:hsa00562]
 213 | 
 214 |  	
 215 |     Energy metabolism
 216 | 
 217 |       00190 Oxidative phosphorylation [PATH:hsa00190]
 218 | 
 219 |       00195 Photosynthesis
 220 |       00196 Photosynthesis - antenna proteins
 221 |       00194 Photosynthesis proteins
 222 |       00710 Carbon fixation in photosynthetic organisms
 223 |       00720 Carbon fixation pathways in prokaryotes
 224 |       00680 Methane metabolism
 225 |       00910 Nitrogen metabolism [PATH:hsa00910]
 226 | 
 227 |       00920 Sulfur metabolism [PATH:hsa00920]
 228 | 
 229 |  	
 230 |     Lipid metabolism
 231 | 
 232 |       00061 Fatty acid biosynthesis [PATH:hsa00061]
 233 | 
 234 |       00062 Fatty acid elongation [PATH:hsa00062]
 235 | 
 236 |       00071 Fatty acid degradation [PATH:hsa00071]
 237 | 
 238 |       00072 Synthesis and degradation of ketone bodies [PATH:hsa00072]
 239 | 
 240 |       00073 Cutin, suberine and wax biosynthesis
 241 | 
 242 |       00100 Steroid biosynthesis [PATH:hsa00100]
 243 | 
 244 |       00120 Primary bile acid biosynthesis [PATH:hsa00120]
 245 | 
 246 |       00121 Secondary bile acid biosynthesis
 247 |       00140 Steroid hormone biosynthesis [PATH:hsa00140]
 248 | 
 249 |       00561 Glycerolipid metabolism [PATH:hsa00561]
 250 | 
 251 |       00564 Glycerophospholipid metabolism [PATH:hsa00564]
 252 | 
 253 |       00565 Ether lipid metabolism [PATH:hsa00565]
 254 | 
 255 |       00600 Sphingolipid metabolism [PATH:hsa00600]
 256 | 
 257 |       00590 Arachidonic acid metabolism [PATH:hsa00590]
 258 | 
 259 |       00591 Linoleic acid metabolism [PATH:hsa00591]
 260 | 
 261 |       00592 alpha-Linolenic acid metabolism [PATH:hsa00592]
 262 | 
 263 |       01040 Biosynthesis of unsaturated fatty acids [PATH:hsa01040]
 264 | 
 265 |       01004 Lipid biosynthesis proteins [BR:hsa01004]
 266 | 
 267 |  	
 268 |     Nucleotide metabolism
 269 | 
 270 |       00230 Purine metabolism [PATH:hsa00230]
 271 | 
 272 |       00240 Pyrimidine metabolism [PATH:hsa00240]
 273 | 
 274 |     Amino acid metabolism
 275 | 
 276 |       00250 Alanine, aspartate and glutamate metabolism [PATH:hsa00250]
 277 | 
 278 |       00260 Glycine, serine and threonine metabolism [PATH:hsa00260]
 279 | 
 280 |       00270 Cysteine and methionine metabolism [PATH:hsa00270]
 281 | 
 282 |       00280 Valine, leucine and isoleucine degradation [PATH:hsa00280]
 283 | 
 284 |       00290 Valine, leucine and isoleucine biosynthesis [PATH:hsa00290]
 285 | 
 286 |       00300 Lysine biosynthesis [PATH:hsa00300]
 287 | 
 288 |       00310 Lysine degradation [PATH:hsa00310]
 289 | 
 290 |       00220 Arginine biosynthesis [PATH:hsa00220]
 291 | 
 292 |       00330 Arginine and proline metabolism [PATH:hsa00330]
 293 | 
 294 |       00340 Histidine metabolism [PATH:hsa00340]
 295 | 
 296 |       00350 Tyrosine metabolism [PATH:hsa00350]
 297 | 
 298 |       00360 Phenylalanine metabolism [PATH:hsa00360]
 299 | 
 300 |       00380 Tryptophan metabolism [PATH:hsa00380]
 301 | 
 302 |       00400 Phenylalanine, tyrosine and tryptophan biosynthesis [PATH:hsa00400]
 303 | 
 304 |       01007 Amino acid related enzymes [BR:hsa01007]
 305 | 
 306 |  	
 307 |     Metabolism of other amino acids
 308 | 
 309 |       00410 beta-Alanine metabolism [PATH:hsa00410]
 310 | 
 311 |       00430 Taurine and hypotaurine metabolism [PATH:hsa00430]
 312 | 
 313 |       00440 Phosphonate and phosphinate metabolism
 314 |       00450 Selenocompound metabolism [PATH:hsa00450]
 315 | 
 316 |       00460 Cyanoamino acid metabolism [PATH:hsa00460]
 317 | 
 318 |       00471 D-Glutamine and D-glutamate metabolism [PATH:hsa00471]
 319 | 
 320 |       00472 D-Arginine and D-ornithine metabolism [PATH:hsa00472]
 321 | 
 322 |       00473 D-Alanine metabolism
 323 |       00480 Glutathione metabolism [PATH:hsa00480]
 324 | 
 325 |  	
 326 |     Glycan biosynthesis and metabolism
 327 | 
 328 |       01003 Glycosyltransferases [BR:hsa01003]
 329 | 
 330 |       00510 N-Glycan biosynthesis [PATH:hsa00510]
 331 | 
 332 |       00513 Various types of N-glycan biosynthesis
 333 |       00512 Mucin type O-glycan biosynthesis [PATH:hsa00512]
 334 | 
 335 |       00514 Other types of O-glycan biosynthesis [PATH:hsa00514]
 336 | 
 337 |       00532 Glycosaminoglycan biosynthesis - chondroitin sulfate / dermatan sulfate [PATH:hsa00532]
 338 | 
 339 |       00534 Glycosaminoglycan biosynthesis - heparan sulfate / heparin [PATH:hsa00534]
 340 | 
 341 |       00533 Glycosaminoglycan biosynthesis - keratan sulfate [PATH:hsa00533]
 342 | 
 343 |       00535 Proteoglycans [BR:hsa00535]
 344 | 
 345 |       00536 Glycosaminoglycan binding proteins [BR:hsa00536]
 346 | 
 347 |       00531 Glycosaminoglycan degradation [PATH:hsa00531]
 348 | 
 349 |       00563 Glycosylphosphatidylinositol(GPI)-anchor biosynthesis [PATH:hsa00563]
 350 | 
 351 |       00601 Glycosphingolipid biosynthesis - lacto and neolacto series [PATH:hsa00601]
 352 | 
 353 |       00603 Glycosphingolipid biosynthesis - globo series [PATH:hsa00603]
 354 | 
 355 |       00604 Glycosphingolipid biosynthesis - ganglio series [PATH:hsa00604]
 356 | 
 357 |       00540 Lipopolysaccharide biosynthesis
 358 |       01005 Lipopolysaccharide biosynthesis proteins
 359 |       00550 Peptidoglycan biosynthesis
 360 |       00511 Other glycan degradation [PATH:hsa00511]
 361 | 
 362 |  	
 363 |     Metabolism of cofactors and vitamins
 364 | 
 365 |       00730 Thiamine metabolism [PATH:hsa00730]
 366 | 
 367 |       00740 Riboflavin metabolism [PATH:hsa00740]
 368 | 
 369 |       00750 Vitamin B6 metabolism [PATH:hsa00750]
 370 | 
 371 |       00760 Nicotinate and nicotinamide metabolism [PATH:hsa00760]
 372 | 
 373 |       00770 Pantothenate and CoA biosynthesis [PATH:hsa00770]
 374 | 
 375 |       00780 Biotin metabolism [PATH:hsa00780]
 376 | 
 377 |       00785 Lipoic acid metabolism [PATH:hsa00785]
 378 | 
 379 |       00790 Folate biosynthesis [PATH:hsa00790]
 380 | 
 381 |       00670 One carbon pool by folate [PATH:hsa00670]
 382 | 
 383 |       00830 Retinol metabolism [PATH:hsa00830]
 384 | 
 385 |       00860 Porphyrin and chlorophyll metabolism [PATH:hsa00860]
 386 | 
 387 |       00130 Ubiquinone and other terpenoid-quinone biosynthesis [PATH:hsa00130]
 388 | 
 389 |  	
 390 |     Metabolism of terpenoids and polyketides
 391 | 
 392 |       01006 Prenyltransferases [BR:hsa01006]
 393 | 
 394 |       00900 Terpenoid backbone biosynthesis [PATH:hsa00900]
 395 | 
 396 |       00902 Monoterpenoid biosynthesis
 397 |       00909 Sesquiterpenoid and triterpenoid biosynthesis
 398 |       00904 Diterpenoid biosynthesis
 399 |       00906 Carotenoid biosynthesis
 400 |       00905 Brassinosteroid biosynthesis
 401 |       00981 Insect hormone biosynthesis
 402 |       00908 Zeatin biosynthesis
 403 |       00903 Limonene and pinene degradation
 404 |       00281 Geraniol degradation
 405 |       01008 Polyketide biosynthesis proteins
 406 |       01052 Type I polyketide structures
 407 |       00522 Biosynthesis of 12-, 14- and 16-membered macrolides
 408 |       01051 Biosynthesis of ansamycins
 409 | 
 410 |       01056 Biosynthesis of type II polyketide backbone
 411 |       01057 Biosynthesis of type II polyketide products
 412 |       00253 Tetracycline biosynthesis
 413 |       00523 Polyketide sugar unit biosynthesis
 414 | 
 415 |       01054 Nonribosomal peptide structures
 416 |       01053 Biosynthesis of siderophore group nonribosomal peptides
 417 |       01055 Biosynthesis of vancomycin group antibiotics
 418 | 
 419 |  	
 420 |     Biosynthesis of other secondary metabolites
 421 | 
 422 |       00940 Phenylpropanoid biosynthesis
 423 |       00945 Stilbenoid, diarylheptanoid and gingerol biosynthesis
 424 |       00941 Flavonoid biosynthesis
 425 |       00944 Flavone and flavonol biosynthesis
 426 |       00942 Anthocyanin biosynthesis
 427 |       00943 Isoflavonoid biosynthesis
 428 |       00901 Indole alkaloid biosynthesis
 429 |       00403 Indole diterpene alkaloid biosynthesis
 430 |       00950 Isoquinoline alkaloid biosynthesis
 431 |       00960 Tropane, piperidine and pyridine alkaloid biosynthesis
 432 |       01058 Acridone alkaloid biosynthesis
 433 |       00232 Caffeine metabolism [PATH:hsa00232]
 434 | 
 435 |       00965 Betalain biosynthesis
 436 |       00966 Glucosinolate biosynthesis
 437 |       00402 Benzoxazinoid biosynthesis
 438 |       00311 Penicillin and cephalosporin biosynthesis
 439 | 
 440 |       00332 Carbapenem biosynthesis
 441 |       00261 Monobactam biosynthesis
 442 | 
 443 |       00331 Clavulanic acid biosynthesis
 444 |       00521 Streptomycin biosynthesis
 445 |       00524 Butirosin and neomycin biosynthesis [PATH:hsa00524]
 446 | 
 447 |       00231 Puromycin biosynthesis
 448 |       00401 Novobiocin biosynthesis
 449 |       00254 Aflatoxin biosynthesis
 450 | 
 451 |  	
 452 |     Xenobiotics biodegradation and metabolism
 453 | 
 454 |       00362 Benzoate degradation
 455 |       00627 Aminobenzoate degradation
 456 |       00364 Fluorobenzoate degradation
 457 |       00625 Chloroalkane and chloroalkene degradation
 458 |       00361 Chlorocyclohexane and chlorobenzene degradation
 459 |       00623 Toluene degradation
 460 |       00622 Xylene degradation
 461 |       00633 Nitrotoluene degradation
 462 |       00642 Ethylbenzene degradation
 463 |       00643 Styrene degradation
 464 |       00791 Atrazine degradation
 465 |       00930 Caprolactam degradation
 466 |       00351 1,1,1-Trichloro-2,2-bis(4-chlorophenyl)ethane (DDT) degradation
 467 |       00363 Bisphenol degradation
 468 |       00621 Dioxin degradation
 469 |       00626 Naphthalene degradation
 470 |       00624 Polycyclic aromatic hydrocarbon degradation
 471 |       00365 Furfural degradation
 472 |       00984 Steroid degradation
 473 |       00980 Metabolism of xenobiotics by cytochrome P450 [PATH:hsa00980]
 474 | 
 475 |       00982 Drug metabolism - cytochrome P450 [PATH:hsa00982]
 476 | 
 477 |       00983 Drug metabolism - other enzymes [PATH:hsa00983]
 478 | 
 479 |  	
 480 |     Enzyme families
 481 | 
 482 |       01000 Enzymes [BR:hsa01000]
 483 |       01001 Protein kinases [BR:hsa01001]
 484 |       01009 Protein phosphatase and associated proteins [BR:hsa01009]
 485 |       01002 Peptidases [BR:hsa01002]
 486 |       00199 Cytochrome P450 [BR:hsa00199]
 487 | 
 488 | 
 489 |  
 490 |   Genetic Information Processing
 491 | 
 492 |  	
 493 |     Transcription
 494 | 
 495 |       03020 RNA polymerase [PATH:hsa03020]
 496 | 
 497 |       03022 Basal transcription factors [PATH:hsa03022]
 498 | 
 499 |       03000 Transcription factors [BR:hsa03000]
 500 | 
 501 |       03021 Transcription machinery [BR:hsa03021]
 502 | 
 503 |       03040 Spliceosome [PATH:hsa03040]
 504 | 
 505 |       03041 Spliceosome [BR:hsa03041]
 506 | 
 507 |  	
 508 |     Translation
 509 | 
 510 |       03010 Ribosome [PATH:hsa03010]
 511 | 
 512 |       03011 Ribosome [BR:hsa03011]
 513 | 
 514 |       03016 Transfer RNA biogenesis [BR:hsa03016]
 515 | 
 516 |       00970 Aminoacyl-tRNA biosynthesis [PATH:hsa00970]
 517 | 
 518 |       03013 RNA transport [PATH:hsa03013]
 519 | 
 520 |       03015 mRNA surveillance pathway [PATH:hsa03015]
 521 | 
 522 |       03019 Messenger RNA Biogenesis [BR:hsa03019]
 523 | 
 524 |       03008 Ribosome biogenesis in eukaryotes [PATH:hsa03008]
 525 | 
 526 |       03009 Ribosome biogenesis [BR:hsa03009]
 527 | 
 528 |       03029 Mitochondrial biogenesis [BR:hsa03029]
 529 | 
 530 |       03012 Translation factors [BR:hsa03012]
 531 | 
 532 |  	
 533 |     Folding, sorting and degradation
 534 | 
 535 |       03110 Chaperones and folding catalysts [BR:hsa03110]
 536 | 
 537 |       03060 Protein export [PATH:hsa03060]
 538 | 
 539 |       04141 Protein processing in endoplasmic reticulum [PATH:hsa04141]
 540 | 
 541 |       04130 SNARE interactions in vesicular transport [PATH:hsa04130]
 542 | 
 543 |       04131 SNAREs [BR:hsa04131]
 544 | 
 545 |       04120 Ubiquitin mediated proteolysis [PATH:hsa04120]
 546 | 
 547 |       04121 Ubiquitin system [BR:hsa04121]
 548 | 
 549 |       04122 Sulfur relay system [PATH:hsa04122]
 550 | 
 551 |       03050 Proteasome [PATH:hsa03050]
 552 | 
 553 |       03051 Proteasome [BR:hsa03051]
 554 | 
 555 |       03018 RNA degradation [PATH:hsa03018]
 556 | 
 557 |  	
 558 |     Replication and repair
 559 | 
 560 |       03030 DNA replication [PATH:hsa03030]
 561 | 
 562 |       03032 DNA replication proteins [BR:hsa03032]
 563 | 
 564 |       03036 Chromosome and associated proteins [BR:hsa03036]
 565 | 
 566 |       03410 Base excision repair [PATH:hsa03410]
 567 | 
 568 |       03420 Nucleotide excision repair [PATH:hsa03420]
 569 | 
 570 |       03430 Mismatch repair [PATH:hsa03430]
 571 | 
 572 |       03440 Homologous recombination [PATH:hsa03440]
 573 | 
 574 |       03450 Non-homologous end-joining [PATH:hsa03450]
 575 | 
 576 |       03460 Fanconi anemia pathway [PATH:hsa03460]
 577 | 
 578 |       03400 DNA repair and recombination proteins [BR:hsa03400]
 579 | 
 580 |  	
 581 |     RNA family
 582 | 
 583 |       03100 Non-coding RNAs [BR:hsa03100]
 584 | 
 585 | 
 586 |  
 587 |   Environmental Information Processing
 588 | 
 589 |  	
 590 |     Membrane transport
 591 | 
 592 |       02000 Transporters [BR:hsa02000]
 593 | 
 594 |       02010 ABC transporters [PATH:hsa02010]
 595 | 
 596 |       02060 Phosphotransferase system (PTS)
 597 |       03070 Bacterial secretion system
 598 |       02044 Secretion system [BR:hsa02044]
 599 | 
 600 |  	
 601 |     Signal transduction
 602 | 
 603 |       02020 Two-component system
 604 |       02022 Two-component system
 605 |       04014 Ras signaling pathway [PATH:hsa04014]
 606 | 
 607 |       04015 Rap1 signaling pathway [PATH:hsa04015]
 608 | 
 609 |       04010 MAPK signaling pathway [PATH:hsa04010]
 610 | 
 611 |       04013 MAPK signaling pathway - fly
 612 |       04011 MAPK signaling pathway - yeast
 613 |       04012 ErbB signaling pathway [PATH:hsa04012]
 614 | 
 615 |       04310 Wnt signaling pathway [PATH:hsa04310]
 616 | 
 617 |       04330 Notch signaling pathway [PATH:hsa04330]
 618 | 
 619 |       04340 Hedgehog signaling pathway [PATH:hsa04340]
 620 | 
 621 |       04350 TGF-beta signaling pathway [PATH:hsa04350]
 622 | 
 623 |       04390 Hippo signaling pathway [PATH:hsa04390]
 624 | 
 625 |       04391 Hippo signaling pathway -fly
 626 |       04370 VEGF signaling pathway [PATH:hsa04370]
 627 | 
 628 |       04630 Jak-STAT signaling pathway [PATH:hsa04630]
 629 | 
 630 |       04064 NF-kappa B signaling pathway [PATH:hsa04064]
 631 | 
 632 |       04668 TNF signaling pathway [PATH:hsa04668]
 633 | 
 634 |       04066 HIF-1 signaling pathway [PATH:hsa04066]
 635 | 
 636 |       04068 FoxO signaling pathway [PATH:hsa04068]
 637 | 
 638 |       04020 Calcium signaling pathway [PATH:hsa04020]
 639 | 
 640 |       04070 Phosphatidylinositol signaling system [PATH:hsa04070]
 641 | 
 642 |       04072 Phospholipase D signaling pathway [PATH:hsa04072]
 643 | 
 644 |       04071 Sphingolipid signaling pathway [PATH:hsa04071]
 645 | 
 646 |       04024 cAMP signaling pathway [PATH:hsa04024]
 647 | 
 648 |       04022 cGMP - PKG signaling pathway [PATH:hsa04022]
 649 | 
 650 |       04151 PI3K-Akt signaling pathway [PATH:hsa04151]
 651 | 
 652 |       04152 AMPK signaling pathway [PATH:hsa04152]
 653 | 
 654 |       04150 mTOR signaling pathway [PATH:hsa04150]
 655 | 
 656 |       04075 Plant hormone signal transduction
 657 |  	
 658 |     Signaling molecules and interaction
 659 | 
 660 |       04030 G protein-coupled receptors [BR:hsa04030]
 661 | 
 662 |       01020 Enzyme-linked receptors [BR:hsa01020]
 663 | 
 664 |       04050 Cytokine receptors [BR:hsa04050]
 665 | 
 666 |       03310 Nuclear receptors [BR:hsa03310]
 667 | 
 668 |       04040 Ion channels [BR:hsa04040]
 669 | 
 670 |       04031 GTP-binding proteins [BR:hsa04031]
 671 | 
 672 |       04080 Neuroactive ligand-receptor interaction [PATH:hsa04080]
 673 | 
 674 |       04060 Cytokine-cytokine receptor interaction [PATH:hsa04060]
 675 | 
 676 |       04052 Cytokines [BR:hsa04052]
 677 | 
 678 |       04512 ECM-receptor interaction [PATH:hsa04512]
 679 | 
 680 |       04514 Cell adhesion molecules (CAMs) [PATH:hsa04514]
 681 | 
 682 |       04516 Cell adhesion molecules and their ligands [BR:hsa04516]
 683 | 
 684 |       04090 CD Molecules [BR:hsa04090]
 685 | 
 686 |       04091 Lectins [BR:hsa04091]
 687 | 
 688 |       02042 Bacterial toxins [BR:hsa02042]
 689 | 
 690 |  
 691 |   Cellular Processes
 692 | 
 693 |  	
 694 |     Transport and catabolism
 695 | 
 696 |       04144 Endocytosis [PATH:hsa04144]
 697 | 
 698 |       04147 Exosome [BR:hsa04147]
 699 | 
 700 |       04145 Phagosome [PATH:hsa04145]
 701 | 
 702 |       04142 Lysosome [PATH:hsa04142]
 703 | 
 704 |       04146 Peroxisome [PATH:hsa04146]
 705 | 
 706 |       04140 Regulation of autophagy [PATH:hsa04140]
 707 | 
 708 |       04139 Regulation of mitophagy - yeast
 709 | 
 710 |       02048 Prokaryotic Defense System [BR:hsa02048]
 711 | 
 712 |  	
 713 |     Cell motility
 714 | 
 715 |       02030 Bacterial chemotaxis
 716 |       02035 Bacterial motility proteins
 717 |       02040 Flagellar assembly
 718 |       04810 Regulation of actin cytoskeleton [PATH:hsa04810]
 719 | 
 720 |       04812 Cytoskeleton proteins [BR:hsa04812]
 721 | 
 722 |  	
 723 |     Cell growth and death
 724 | 
 725 |       04110 Cell cycle [PATH:hsa04110]
 726 | 
 727 |       04111 Cell cycle - yeast
 728 |       04112 Cell cycle - Caulobacter
 729 |       04113 Meiosis - yeast
 730 |       04114 Oocyte meiosis [PATH:hsa04114]
 731 | 
 732 |       04210 Apoptosis [PATH:hsa04210]
 733 | 
 734 |       04115 p53 signaling pathway [PATH:hsa04115]
 735 | 
 736 |  	
 737 |     Cellular commiunity
 738 | 
 739 |       04510 Focal adhesion [PATH:hsa04510]
 740 | 
 741 |       04520 Adherens junction [PATH:hsa04520]
 742 | 
 743 |       04530 Tight junction [PATH:hsa04530]
 744 | 
 745 |       04540 Gap junction [PATH:hsa04540]
 746 | 
 747 |       04550 Signaling pathways regulating pluripotency of stem cells [PATH:hsa04550]
 748 | 
 749 |  
 750 |   Organismal Systems
 751 | 
 752 |  	
 753 |     Immune system
 754 | 
 755 |       04640 Hematopoietic cell lineage [PATH:hsa04640]
 756 | 
 757 |       04610 Complement and coagulation cascades [PATH:hsa04610]
 758 | 
 759 |       04611 Platelet activation [PATH:hsa04611]
 760 | 
 761 |       04620 Toll-like receptor signaling pathway [PATH:hsa04620]
 762 | 
 763 |       04621 NOD-like receptor signaling pathway [PATH:hsa04621]
 764 | 
 765 |       04622 RIG-I-like receptor signaling pathway [PATH:hsa04622]
 766 | 
 767 |       04623 Cytosolic DNA-sensing pathway [PATH:hsa04623]
 768 | 
 769 |       04650 Natural killer cell mediated cytotoxicity [PATH:hsa04650]
 770 | 
 771 |       04612 Antigen processing and presentation [PATH:hsa04612]
 772 | 
 773 |       04660 T cell receptor signaling pathway [PATH:hsa04660]
 774 | 
 775 |       04662 B cell receptor signaling pathway [PATH:hsa04662]
 776 | 
 777 |       04664 Fc epsilon RI signaling pathway [PATH:hsa04664]
 778 | 
 779 |       04666 Fc gamma R-mediated phagocytosis [PATH:hsa04666]
 780 | 
 781 |       04670 Leukocyte transendothelial migration [PATH:hsa04670]
 782 | 
 783 |       04672 Intestinal immune network for IgA production [PATH:hsa04672]
 784 | 
 785 |       04062 Chemokine signaling pathway [PATH:hsa04062]
 786 | 
 787 |  	
 788 |     Endocrine system
 789 | 
 790 |       04911 Insulin secretion [PATH:hsa04911]
 791 | 
 792 |       04910 Insulin signaling pathway [PATH:hsa04910]
 793 | 
 794 |       04922 Glucagon signaling pathway [PATH:hsa04922]
 795 | 
 796 |       04923 Regulation of lipolysis in adipocyte [PATH:hsa04923]
 797 | 
 798 |       04920 Adipocytokine signaling pathway [PATH:hsa04920]
 799 | 
 800 |       03320 PPAR signaling pathway [PATH:hsa03320]
 801 | 
 802 |       04912 GnRH signaling pathway [PATH:hsa04912]
 803 | 
 804 |       04913 Ovarian Steroidogenesis [PATH:hsa04913]
 805 | 
 806 |       04915 Estrogen signaling pathway [PATH:hsa04915]
 807 | 
 808 |       04914 Progesterone-mediated oocyte maturation [PATH:hsa04914]
 809 | 
 810 |       04917 Prolactin signaling pathway [PATH:hsa04917]
 811 | 
 812 |       04921 Oxytocin signaling pathway [PATH:hsa04921]
 813 | 
 814 |       04918 Thyroid hormone synthesis [PATH:hsa04918]
 815 | 
 816 |       04919 Thyroid hormone signaling pathway [PATH:hsa04919]
 817 | 
 818 |       04916 Melanogenesis [PATH:hsa04916]
 819 | 
 820 |       04924 Renin secretion [PATH:hsa04924]
 821 | 
 822 |       04614 Renin-angiotensin system [PATH:hsa04614]
 823 | 
 824 |       04925 Aldosterone synthesis and secretion [PATH:hsa04925]
 825 | 
 826 |  	
 827 |     Circulatory system
 828 | 
 829 |       04260 Cardiac muscle contraction [PATH:hsa04260]
 830 | 
 831 |       04261 Adrenergic signaling in cardiomyocytes [PATH:hsa04261]
 832 | 
 833 |       04270 Vascular smooth muscle contraction [PATH:hsa04270]
 834 | 
 835 |  	
 836 |     Digestive system
 837 | 
 838 |       04970 Salivary secretion [PATH:hsa04970]
 839 | 
 840 |       04971 Gastric acid secretion [PATH:hsa04971]
 841 | 
 842 |       04972 Pancreatic secretion [PATH:hsa04972]
 843 | 
 844 |       04976 Bile secretion [PATH:hsa04976]
 845 | 
 846 |       04973 Carbohydrate digestion and absorption [PATH:hsa04973]
 847 | 
 848 |       04974 Protein digestion and absorption [PATH:hsa04974]
 849 | 
 850 |       04975 Fat digestion and absorption [PATH:hsa04975]
 851 | 
 852 |       04977 Vitamin digestion and absorption [PATH:hsa04977]
 853 | 
 854 |       04978 Mineral absorption [PATH:hsa04978]
 855 | 
 856 |  	
 857 |     Excretory system
 858 | 
 859 |       04962 Vasopressin-regulated water reabsorption [PATH:hsa04962]
 860 | 
 861 |       04960 Aldosterone-regulated sodium reabsorption [PATH:hsa04960]
 862 | 
 863 |       04961 Endocrine and other factor-regulated calcium reabsorption [PATH:hsa04961]
 864 | 
 865 |       04964 Proximal tubule bicarbonate reclamation [PATH:hsa04964]
 866 | 
 867 |       04966 Collecting duct acid secretion [PATH:hsa04966]
 868 | 
 869 |  	
 870 |     Nervous system
 871 | 
 872 |       04724 Glutamatergic synapse [PATH:hsa04724]
 873 | 
 874 |       04727 GABAergic synapse [PATH:hsa04727]
 875 | 
 876 |       04725 Cholinergic synapse [PATH:hsa04725]
 877 | 
 878 |       04728 Dopaminergic synapse [PATH:hsa04728]
 879 | 
 880 |       04726 Serotonergic synapse [PATH:hsa04726]
 881 | 
 882 |       04720 Long-term potentiation [PATH:hsa04720]
 883 | 
 884 |       04730 Long-term depression [PATH:hsa04730]
 885 | 
 886 |       04723 Retrograde endocannabinoid signaling [PATH:hsa04723]
 887 | 
 888 |       04721 Synaptic vesicle cycle [PATH:hsa04721]
 889 | 
 890 |       04722 Neurotrophin signaling pathway [PATH:hsa04722]
 891 | 
 892 |  	
 893 |     Sensory system
 894 | 
 895 |       04744 Phototransduction [PATH:hsa04744]
 896 | 
 897 |       04745 Phototransduction - fly
 898 |       04740 Olfactory transduction [PATH:hsa04740]
 899 | 
 900 |       04742 Taste transduction [PATH:hsa04742]
 901 | 
 902 |       04750 Inflammatory mediator regulation of TRP channels [PATH:hsa04750]
 903 | 
 904 |  	
 905 |     Development
 906 | 
 907 |       04320 Dorso-ventral axis formation [PATH:hsa04320]
 908 | 
 909 |       04360 Axon guidance [PATH:hsa04360]
 910 | 
 911 |       04380 Osteoclast differentiation [PATH:hsa04380]
 912 | 
 913 |  	
 914 |     Environmental adaptation
 915 | 
 916 |       04710 Circadian rhythm [PATH:hsa04710]
 917 | 
 918 |       04713 Circadian entrainment [PATH:hsa04713]
 919 | 
 920 |       04711 Circadian rhythm - fly
 921 |       04712 Circadian rhythm - plant
 922 |       04626 Plant-pathogen interaction
 923 |  
 924 |   Human Diseases
 925 | 
 926 |  	
 927 |     Cancers
 928 | 
 929 |       05200 Pathways in cancer [PATH:hsa05200]
 930 | 
 931 |       05230 Central carbon metabolism in cancer [PATH:hsa05230]
 932 | 
 933 |       05231 Choline metabolism in cancer [PATH:hsa05231]
 934 | 
 935 |       05202 Transcriptional misregulation in cancers [PATH:hsa05202]
 936 | 
 937 |       05206 MicroRNAs in cancer [PATH:hsa05206]
 938 | 
 939 |       05205 Proteoglycans in cancer [PATH:hsa05205]
 940 | 
 941 |       05204 Chemical carcinogenesis [PATH:hsa05204]
 942 | 
 943 |       05203 Viral carcinogenesis [PATH:hsa05203]
 944 | 
 945 |       05210 Colorectal cancer [PATH:hsa05210]
 946 | 
 947 |       05212 Pancreatic cancer [PATH:hsa05212]
 948 | 
 949 |       05214 Glioma [PATH:hsa05214]
 950 | 
 951 |       05216 Thyroid cancer [PATH:hsa05216]
 952 | 
 953 |       05221 Acute myeloid leukemia [PATH:hsa05221]
 954 | 
 955 |       05220 Chronic myeloid leukemia [PATH:hsa05220]
 956 | 
 957 |       05217 Basal cell carcinoma [PATH:hsa05217]
 958 | 
 959 |       05218 Melanoma [PATH:hsa05218]
 960 | 
 961 |       05211 Renal cell carcinoma [PATH:hsa05211]
 962 | 
 963 |       05219 Bladder cancer [PATH:hsa05219]
 964 | 
 965 |       05215 Prostate cancer [PATH:hsa05215]
 966 | 
 967 |       05213 Endometrial cancer [PATH:hsa05213]
 968 | 
 969 |       05222 Small cell lung cancer [PATH:hsa05222]
 970 | 
 971 |       05223 Non-small cell lung cancer [PATH:hsa05223]
 972 | 
 973 |  	
 974 |     Immune diseases
 975 | 
 976 |       05310 Asthma [PATH:hsa05310]
 977 | 
 978 |       05322 Systemic lupus erythematosus [PATH:hsa05322]
 979 | 
 980 |       05323 Rheumatoid arthritis [PATH:hsa05323]
 981 | 
 982 |       05320 Autoimmune thyroid disease [PATH:hsa05320]
 983 | 
 984 |       05321 Inflammatiory bowel disease (IBD) [PATH:hsa05321]
 985 | 
 986 |       05330 Allograft rejection [PATH:hsa05330]
 987 | 
 988 |       05332 Graft-versus-host disease [PATH:hsa05332]
 989 | 
 990 |       05340 Primary immunodeficiency [PATH:hsa05340]
 991 | 
 992 |  	
 993 |     Neurodegenerative diseases
 994 | 
 995 |       05010 Alzheimer's disease [PATH:hsa05010]
 996 | 
 997 |       05012 Parkinson's disease [PATH:hsa05012]
 998 | 
 999 |       05014 Amyotrophic lateral sclerosis (ALS) [PATH:hsa05014]
1000 | 
1001 |       05016 Huntington's disease [PATH:hsa05016]
1002 | 
1003 |       05020 Prion diseases [PATH:hsa05020]
1004 | 
1005 |  	
1006 |     Substance dependence
1007 | 
1008 |       05030 Cocaine addiction [PATH:hsa05030]
1009 | 
1010 |       05031 Amphetamine addiction [PATH:hsa05031]
1011 | 
1012 |       05032 Morphine addiction [PATH:hsa05032]
1013 | 
1014 |       05033 Nicotine addiction [PATH:hsa05033]
1015 | 
1016 |       05034 Alcoholism [PATH:hsa05034]
1017 | 
1018 |  	
1019 |     Cardiovascular diseases
1020 | 
1021 |       05410 Hypertrophic cardiomyopathy (HCM) [PATH:hsa05410]
1022 | 
1023 |       05412 Arrhythmogenic right ventricular cardiomyopathy (ARVC) [PATH:hsa05412]
1024 | 
1025 |       05414 Dilated cardiomyopathy (DCM) [PATH:hsa05414]
1026 | 
1027 |       05416 Viral myocarditis [PATH:hsa05416]
1028 | 
1029 |  	
1030 |     Endocrine and metabolic diseases
1031 | 
1032 |       04930 Type II diabetes mellitus [PATH:hsa04930]
1033 | 
1034 |       04940 Type I diabetes mellitus [PATH:hsa04940]
1035 | 
1036 |       04950 Maturity onset diabetes of the young [PATH:hsa04950]
1037 | 
1038 |       04932 Non-alcoholic fatty liver disease (NAFLD) [PATH:hsa04932]
1039 | 
1040 |       04931 Insulin resistance [PATH:hsa04931]
1041 | 
1042 |       04933 AGE-RAGE signaling pathway in diabetic complications [PATH:hsa04933]
1043 | 
1044 |  	
1045 |     Infectious diseases
1046 | 
1047 |       05110 Vibrio cholerae infection [PATH:hsa05110]
1048 | 
1049 |       05111 Vibrio cholerae pathogenic cycle
1050 |       05120 Epithelial cell signaling in Helicobacter pylori infection [PATH:hsa05120]
1051 | 
1052 |       05130 Pathogenic Escherichia coli infection [PATH:hsa05130]
1053 | 
1054 |       05132 Salmonella infection [PATH:hsa05132]
1055 | 
1056 |       05131 Shigellosis [PATH:hsa05131]
1057 | 
1058 |       05133 Pertussis [PATH:hsa05133]
1059 | 
1060 |       05134 Legionellosis [PATH:hsa05134]
1061 | 
1062 |       05150 Staphylococcus aureus infection [PATH:hsa05150]
1063 | 
1064 |       05152 Tuberculosis [PATH:hsa05152]
1065 | 
1066 |       05100 Bacterial invasion of epithelial cells [PATH:hsa05100]
1067 | 
1068 |       05166 HTLV-I infection [PATH:hsa05166]
1069 | 
1070 |       05162 Measles [PATH:hsa05162]
1071 | 
1072 |       05164 Influenza A [PATH:hsa05164]
1073 | 
1074 |       05161 Hepatitis B [PATH:hsa05161]
1075 | 
1076 |       05160 Hepatitis C [PATH:hsa05160]
1077 | 
1078 |       05168 Herpes simplex infection [PATH:hsa05168]
1079 | 
1080 |       05169 Epstein-Barr virus infection [PATH:hsa05169]
1081 | 
1082 |       05146 Amoebiasis [PATH:hsa05146]
1083 | 
1084 |       05144 Malaria [PATH:hsa05144]
1085 | 
1086 |       05145 Toxoplasmosis [PATH:hsa05145]
1087 | 
1088 |       05140 Leishmaniasis [PATH:hsa05140]
1089 | 
1090 |       05142 Chagas disease (American trypanosomiasis) [PATH:hsa05142]
1091 | 
1092 |       05143 African trypanosomiasis [PATH:hsa05143]
1093 | 
1094 |  	
1095 |     Drug resistance
1096 | 
1097 |       01501 beta-Lactam resistance
1098 |       01502 Vancomycin resistance
1099 |       01503 Cationic antimicrobial peptide (CAMP) resistance	
1100 | 
1101 | 


--------------------------------------------------------------------------------
/lncRNA-record.txt:
--------------------------------------------------------------------------------
  1 | 写网页工具引用的人可真多： WEB-based GEne SeT AnaLysis Toolkit
  2 | http://bioinfo.vanderbilt.edu/webgestalt/
  3 | http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3692109/
  4 | J Wang - ‎2013 - ‎被引用次数：375 - ‎相关文章
  5 | 
  6 | 
  7 | You can easily download gencode annotation and filter genes transcripts according the number of exons.
  8 | curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.long_noncoding_RNAs.gtf.gz" | 
  9 |     gunzip -c | 
 10 |     awk '($3=="exon") {print $12}' | 
 11 |     sort | uniq -c | 
 12 |     awk '($1 == 1) {print $2}' | head -n 5
 13 | 所有序列均可下载在The GENCODE v7 catalog of human long noncoding RNAs：
 14 | paper: http://genome.cshlp.org/content/22/9/1775.full
 15 | FTP地址：ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/ 
 16 | GENCODE最新版是v24 ： wget -c -r -np -k -L -A "*metadata*" ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/ 
 17 | 检查里面的记录数： ls *gz |while read id;do (echo -n $id;echo -n "    " ;zcat $id |wc -l ) ;done
 18 | 可以与官网的统计信息相对应： http://www.gencodegenes.org/stats.html
 19 | gencode.v24.metadata.Annotation_remark.gz    40879
 20 | gencode.v24.metadata.EntrezGene.gz    170466
 21 | gencode.v24.metadata.Exon_supporting_feature.gz    19193542
 22 | gencode.v24.metadata.Gene_source.gz    66206
 23 | gencode.v24.metadata.HGNC.gz    182831
 24 | gencode.v24.metadata.PDB.gz    94547
 25 | gencode.v24.metadata.PolyA_feature.gz    84652
 26 | gencode.v24.metadata.Pubmed_id.gz    209094
 27 | gencode.v24.metadata.RefSeq.gz    75365
 28 | gencode.v24.metadata.Selenocysteine.gz    119
 29 | gencode.v24.metadata.SwissProt.gz    45067
 30 | gencode.v24.metadata.Transcript_source.gz    217202
 31 | gencode.v24.metadata.Transcript_supporting_feature.gz    87375
 32 | gencode.v24.metadata.TrEMBL.gz    61924
 33 | 
 34 | 还可以下载所有的gtf文件：
 35 | wget -c -r -np -nd -k -L -A "*gtf.gz" ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/ 
 36 | 
 37 | 根据基因的ENSG系列ID，可以直接取http://exac.broadinstitute.org/gene/ENSG00000236915 查看信息
 38 | 
 39 | 还可以下载参考转录组及参考蛋白组，我这里还是拿hg19举例：
 40 | ## ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.transcripts.fa.gz
 41 | ## ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.lncRNA_transcripts.fa.gz 
 42 | ## ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.pc_transcripts.fa.gz
 43 | 
 44 | 
 45 | 还有一些其它组织维护的数据库：
 46 | http://www.lncipedia.org/download
 47 | http://genome.igib.res.in/lncRNome/
 48 | 
 49 | 
 50 | 芯片数据分析lncRNA： http://pubmedcentralcanada.ca/pmcc/articles/PMC3691033/
 51 | Dr. Zhen-Yu Zhang, Department of Gastroenterology, Nanjing First Hospital, Nanjing Medical University
 52 | 
 53 | Human exon arrays for gastric cancer and normal adjacent tissue were downloaded from the GEO. Two data sets were included: GSE27342 and GSE33335. 
 54 | GSE27342 was used as an experimental set to discover differentially expressed lncRNAs in gastric cancer while GSE33335 was used as a validation set.
 55 | GSE27342 consisted of 80 paired gastric cancer and normal adjacent tissue, including 4 stage I, 7 stage II, 54 stage III and 7 stage IV
 56 | GSE33335 consisted of 25 paired gastric cancer and normal adjacent tissue obtained from the tissue bank of Shanghai Biochip Center, Shanghai, China
 57 | 
 58 | The Affymetrix Human Exon 1.0 ST array consists of over 6.5 million individual probes designed along the entire length of the gene as opposed to just the 3’ end, 
 59 | providing a unique platform for mining lncRNA profiles
 60 | We identified 136053 probes from the Affymetrix Human Exon 1.0 ST array uniquely mapping to lncRNAs at the gene level.
 61 | 方法如下： 
 62 | The probe sequences of the human exon array were downloaded from the Affymetrix website (http://www.affymetrix.com/Auth/analysis/downloads/na25/wtexon/HuEx-1_0-st-v2.probe.tab.zip) 
 63 | and aligned to the sequences of protein-coding and non-coding transcripts using BLAST-2.2.26+
 64 | 
 65 | These probes correspond to 9294 lncRNAs, covering nearly 76% of the GENCODE lncRNA data set.
 66 | 首先分析 GSE27342 数据，找差异的lncRNA (genome-scale transcriptomic analyses on 80 paired gastric cancer/reference tissues)
 67 | The CEL files were processed by Affymetrix Power Tools for background correction, normalization, and summarizations with RMA algorithm
 68 | Using LIMMA with an adjusted P-value of less than 0.01 as a threshold, we identified 88 lncRNAs(gastric cancer VS normal tissue)
 69 | 
 70 | 然后用 GSE33335 数据来做验证 (25 pairs of gastric tissues: gastric cancer tissues vs. matched adjacent noncancerous tissues.)
 71 | To independently validate our results, we conducted the same analysis on GSE33335 and found that 59% of the differentially expressed lncRNAs identified by above analysis showed significant expression changes (adjusted P < 0.01) with the same direction. 
 72 | 
 73 | http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE33335
 74 | http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse27342
 75 | 
 76 | pre-step:
 77 | http://www.lncrnablog.com/
 78 | 
 79 | step1:read paper and get the workflow for lncRNA anlaysis 
 80 | ## http://www.sciencedirect.com/science/article/pii/S1934590913000982
 81 | ## Integration of Genome-wide Approaches Identifies lncRNAs of Adult Neural Stem Cells and Their Progeny In Vivo
 82 | We utilize complementary genome-wide techniques including RNA-seq, RNA CaptureSeq, and ChIP-seq to associate specific lncRNAs with neural cell types, developmental processes, and human disease states. 
 83 | By integrating data from chromatin state maps, custom microarrays, and FACS purification of the subventricular zone lineage, we stringently identify lncRNAs with potential roles in adult neurogenesis.
 84 | 
 85 | 
 86 | Advances in RNA-Seq have opened the way to unbiased and efficient assays of the transcriptome of any mammalian cell
 87 | Recent studies in mouse and human cells have mostly focused on using RNA-Seq to study known genes
 88 | and have depended on existing annotations. 
 89 | They were thus of limited utility for discovering the complete gene structure of lincRNAs or other noncoding transcripts.
 90 | 
 91 | Because lncRNAs exhibit tissue-specific expression,
 92 | previous mouse lncRNA databases
 93 | were not likely comprehensive for lncRNAs involved in adult neurogenesis.
 94 | 
 95 | employing an RNA-seq and ab initio transcriptome reconstruction approach
 96 | 对下面3种组织或者细胞系进行RNA-seq，然后用Cufflinks来分析数据
 97 | SVZ (229 million reads),
 98 | OB (248 million reads), 
 99 | DG (157 million reads).
100 | 还利用了2个公共数据(GSE20851)To broaden our lncRNA catalog, we also include
101 | embryonic stem cells (ESCs)
102 | ESC-derived neural progenitors cells (ESC-NPCs)
103 | 公共数据来自于：
104 | 共计800M的reads
105 | With this collection of over 800 million paired-end reads, we used Cufflinks to perfom ab initio transcript assembly.
106 | 
107 | paper:2010-Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs
108 | 
109 | protein-coding genes 的转录本重合度非常好,93% 
110 | 但是lncRNA找到了非常多的全新的，本次总共找到 8,992 lncRNAs encoded from 5,731 loci
111 | There were
112 | 6,876 (76.5%) novel ones compared to RefSeq genes, 5,044
113 | (56.1%) were novel compared to UCSC known genes, and
114 | 3,680 (40.9%) were novel compared to all Ensembl genes.
115 | 
116 | Interestingly,
117 | 2,108 transcripts (23.4%) were uniquely recovered from
118 | our SVZ/OB/DG reads.
119 | 而且用coding potential calculator来验证我们找到的lncRNAs中，有80%的确没有ORF
120 | lncRNAs were expressed at lower levels than protein-coding genes (2.49-fold difference;
121 | their exons were less strongly conserved than protein-coding exons by PhastCons scores
122 | 部分lncRNA的转录起始位点跟临近的蛋白编码基因的起始位置距离太近了：the TSS of 2,265 lncRNAs (25.2%) in our catalog were located within 5 kb of a protein-coding gene promoter
123 | 对这些基因进行GO分析，发现它们包含着homedomain的转录因子，脑组织特异性表达基因，还有被Polycomb repressive complex 2 抑制的基因
124 | 
125 | 也有一部分lncRNA跟它临近的蛋白编码基因的表达量相关性很高！
126 | 
127 | Differential
128 | gene expression identified 1,621 genes enriched >2-fold in the
129 | SVZ cDNA library as compared to the cDNAs from cells in the
130 | adjacent nonneurogenic striatum (76.4 million reads).
131 | 
132 | 接下来做了6种区域的lncRNA 芯片数据了,来自于另一篇文章：GSE45282 
133 | To explore lncRNA expression patterns in multiple adult brain regions and embryonic forebrain development,
134 | adult cortex
135 | adult whole prefrontal cortex (PFC)
136 | adult preoptic
137 | area (POA)
138 | whole embryonic day 15 (E15) brain 
139 | specific regions of the developing E14.5 cortex (ventricular zone, intermediate zone, and cortical plate)
140 | 对表达量的距离分析表明：region-specific and temporally related expression of both mRNAs and lncRNAs
141 | 
142 | 文章还挖掘了22个公共数据：Using RNA-seq data from 22 samples , we constructed transcript coexpression networks comprised of both mRNAs and lncRNAs.
143 | 
144 | 因为本文发现的新的lncRNA没有前人研究注释过，所以作者采用了RNA CaptureSeq Verifies SVZ lncRNA Expression and Identifies Novel Splice Isoforms来验证
145 | RNA CaptureSeq probe library, we tiled across 100 MB of putative lncRNA loci and 30 MB of protein-coding regions as a control.
146 | 结果又新发现了rare lncRNAs 和 uncommon splice isoforms 在SVZ 的转录组里面 yielding more than 3,500 lncRNAs that could not be detected by the short-read sequencing technology
147 | A complete annotation of CaptureSeq-derived transcripts is available at http://neurosurgery.ucsf.edu/danlimlab/lncRNA.
148 | 
149 | 作者还研究了Correlation between Histone Modifications and lncRNA Expression
150 | Our analysis of chromatin state maps and transcript expression suggest that histone modifications correlate with lncRNA expression in a manner similar to that of protein-coding genes.
151 | 
152 | All data are deposited in NCBI GEO under accession number GSE45282.
153 | 
154 | library strategy: RNA CaptureSeq
155 | Reads aligned using BLAT
156 | Newbler 2.6 used for de novo assembly of transcriptome
157 | FPKM values calculated using RSEM included in the Trinity package
158 | Genome_build: mm9
159 | 
160 | step2:download the raw data from NCBI-GEO-SRA database
161 | ## http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE45282
162 | cd ~ 
163 | mkdir lncRNA_test
164 | cd lncRNA_test
165 | mkdir paper_results
166 | cd    paper_results
167 | wget ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1100nnn/GSM1100702/suppl/GSM1100702_CaptureSeq_isotigs.tsv.gz
168 | wget ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1100nnn/GSM1100702/suppl/GSM1100702_target_regions.bed.gz
169 | 
170 | 
171 | # RNA-seq     GSE45278 10个sra文件： ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP019/SRP019780
172 | ## SRR786689~SRR786698  ##数据总结见：　http://www.ncbi.nlm.nih.gov/sra/?term=SRP019780
173 | ####################### 样本来源：####################### ####################### 
174 | RNA-seq (both paired end and single) 
175 | adult neurogenic niches- subventricular zone (SVZ) 
176 | olfactory bulb (OB)  
177 | dentate gyrus (DG) 
178 | control non-neurogenic tissue, striatum (STR) 
179 | ####################### ####################### ####################### 
180 | #####RNA-seq数据还整合了一个公共数据，也需要下载： http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20851
181 | #####数据量有点大：ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP002/SRP002325
182 | 
183 | ####################### 根据上面RNA-seq数据分析的结果
184 | # SVZ Capture GSE45277 10个sra文件： ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX252/SRX252176
185 | ## SRR786699~SRR786708　　## 就一个样本： http://www.ncbi.nlm.nih.gov/sra/?term=SRX252176
186 | 
187 | step3:quality control for the sequence data  
188 | ## paper:2012- analysis of RNA-seq experiments with TopHat and Cufflinks : http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
189 | 
190 | ## 作者用的是tophat2+cufflinks+CummeRbund 我这里替换成HISAT+StringTie+ballgown
191 | ## https://speakerdeck.com/stephenturner/rna-seq-qc-and-data-analysis-using-the-tuxedo-suite
192 | ## pre-step:  install software: tophat2+cufflinks,HISAT+StringTie+ballgown,mm9  
193 | ### also :  kallisto+Sailfish +salmon 
194 | 
195 | ## Download and install tophat2
196 | ## http://www.ccb.jhu.edu/software/tophat/index.shtml   ## http://www.ccb.jhu.edu/software/tophat/manual.shtml
197 | cd ~/biosoft
198 | mkdir tophat2 &&  cd tophat2 
199 | wget http://www.ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz 
200 | tar zxvf  
201 | 
202 | ## Download and install cufflinks
203 | ## http://cole-trapnell-lab.github.io/cufflinks/  ## http://cole-trapnell-lab.github.io/cufflinks/install/
204 | cd ~/biosoft
205 | mkdir cufflinks &&  cd cufflinks 
206 | wget http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz 
207 | tar zxvf  
208 | 
209 | ## Download and install HISAT
210 | ## http://ccb.jhu.edu/software/hisat2/index.shtml　　##http://ccb.jhu.edu/software/hisat2/manual.shtml
211 | 
212 | cd ~/biosoft
213 | mkdir HISAT &&  cd HISAT 
214 | wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.0.4-Linux_x86_64.zip
215 | tar zxvf  
216 | 
217 | ## Download and install StringTie
218 | ## https://ccb.jhu.edu/software/stringtie/  ## https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual
219 | cd ~/biosoft
220 | mkdir StringTie &&  cd StringTie 
221 | wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-1.2.3.Linux_x86_64.tar.gz 
222 | tar zxvf  
223 | 
224 | ## CummeRbund  ballgown  ( R - bioconductor packges: )
225 | ## http://compbio.mit.edu/cummeRbund/
226 | ## https://github.com/alyssafrazee/ballgown
227 | source("http://bioconductor.org/biocLite.R")
228 | biocLite("ballgown")
229 | biocLite("CummeRbund")
230 | 
231 | ##　RSEM+HTseq+Bedtools 　https://github.com/bli25ucb/RSEM_tutorial　　　
232 | ##　RNA-Seq transcript quantification program without alignment : kallisto+Sailfish +salmon
233 | ## reference transcripts 需要自行下载 
234 | ## ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.transcripts.fa.gz
235 | ## ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.lncRNA_transcripts.fa.gz 
236 | ## ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh37_mapping/gencode.v24lift37.pc_transcripts.fa.gz
237 | ## a FASTA file containing your reference transcripts and a (set of) FASTA/FASTQ file(s) containing your reads  
238 | ## Download and install kallisto
239 | ## https://pachterlab.github.io/kallisto/starting
240 | cd ~/biosoft
241 | mkdir StringTie &&  cd kallisto 
242 | wget https://github.com/pachterlab/kallisto/releases/download/v0.43.0/kallisto_linux-v0.43.0.tar.gz
243 | tar zxvf  
244 | 
245 | ## Download and install Sailfish
246 | ## http://www.cs.cmu.edu/~ckingsf/software/sailfish/  ## 
247 | cd ~/biosoft
248 | mkdir Sailfish &&  cd Sailfish 
249 | wget   https://github.com/kingsfordgroup/sailfish/releases/download/v0.9.2/SailfishBeta-0.9.2_DebianSqueeze.tar.gz 
250 | tar zxvf  
251 | 
252 | ## Download and install salmon
253 | ## http://salmon.readthedocs.io/en/latest/salmon.html ## 
254 | cd ~/biosoft
255 | mkdir salmon &&  cd salmon 
256 | ## https://github.com/COMBINE-lab/salmon
257 | tar zxvf  
258 | 
259 | ## Download and install bowtie
260 | cd ~/biosoft
261 | mkdir bowtie &&  cd bowtie
262 | wget https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.9/bowtie2-2.2.9-linux-x86_64.zip/download
263 | #Length: 27073243 (26M) [application/octet-stream]
264 | #Saving to: "download"   ## I made a mistake here for downloading the bowtie2 
265 | mv download  bowtie2-2.2.9-linux-x86_64.zip
266 | unzip bowtie2-2.2.9-linux-x86_64.zip
267 |  
268 | mkdir -p ~/biosoft/bowtie/hg19_index 
269 | cd ~/biosoft/bowtie/hg19_index
270 | 
271 | # download hg19 chromosome fasta files
272 | wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
273 | # unzip and concatenate chromosome and contig fasta files
274 | tar zvfx chromFa.tar.gz
275 | cat *.fa > hg19.fa
276 | rm chr*.fa
277 | ##  ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build  ~/biosoft/bowtie/hg19_index/hg19.fa  ~/biosoft/bowtie/hg19_index/hg19
278 | 
279 | 
280 | step4:mapping the reads to reference genome/transcriptome 
281 | 
282 | step5: de nove identification lncRNA
283 | paper:Genome-wide computational identification and manual annotation  of human long noncoding RNA genes :http://rnajournal.cshlp.org/content/16/8/1478.short
284 | paper:A complete annotation of CaptureSeq-derived transcripts is available at http://neurosurgery.ucsf.edu/danlimlab/lncRNA.
285 | 
286 | 
287 | step6:counts the expression lever for each LncRNA
288 | 
289 | step7:find the differentially expressed  LncRNA  
290 | 
291 | step8: function anlaysis for the lncRNA 
292 | paper:2016-Discovery and functional analysis of lncRNAs:  http://www.sciencedirect.com/science/article/pii/S1874939915002163 
293 | paper:2014-Genome-wide screening and functional analysis identify a large number of long noncoding RNAs involved in the sexual reproduction of rice:  https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0512-1
294 | Introduction: sequence-->expression-->function: http://www.exiqon.com/lncrna
295 | paper:2014-lncRNAtor-a comprehensive resource for functional investigation of long noncoding RNAs:http://lncrnator.ewha.ac.kr/index.htm
296 | paper:2014-Genome-Wide Analysis of Long Noncoding RNA (lncRNA) Expression in Hepatoblastoma Tissues : http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0085599
297 | paper:2015- Predicting the Functions of Long Noncoding RNAs Using RNA-Seq Based on Bayesian Network: http://www.hindawi.com/journals/bmri/2015/839590/
298 | paper:2016-Long Non-coding RNA in Neurons: New Players in Early Response to BDNF Stimulation : http://journal.frontiersin.org/article/10.3389/fnmol.2016.00015/full
299 | Figure 7: GO analysis of the biological function of lncRNA: http://www.nature.com/articles/srep21499/figures/7
300 | Figure 2. GO enrichment analysis of lncRNA targets. GO annotations of lncRNA targets categorized by (A) biological process, (B) cell component and (C) molecular function. :https://www.spandidos-publications.com/or/31/4/1613  ## http://www.oatext.com/Genome-wide-analysis-of-differentially-expressed-long-noncoding-RNAs-induced-by-low-shear-stress-in-human-umbilical-vein-endothelial-cells.php
301 | Figure 3. Functional classification of lncRNA  : https://www.spandidos-publications.com/ijo/45/2/619
302 | paper:2016-Expression profiles of long-noncoding RNAs in cutaneous squamous cell carcinoma. : http://www.futuremedicine.com/doi/abs/10.2217/epi-2015-0012
303 | 
304 | step9:lncRNA-mRNA co-expression network  
305 | 
306 | paper:2015-biomarkers-OV-Comprehensive analysis of lncRNA-mRNA co-expression patterns: http://www.nature.com/articles/srep17683
307 | paper:2016-Differential lncRNA-mRNA co-expression network analysis : http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4732855/
308 | paper:2014-Microarray Profiling and Co-Expression Network Analysis of Circulating lncRNAs and mRNAs :http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093388
309 | paper:2014-Long Noncoding RNA-EBIC Promotes Tumor Cell Invasion by Binding to EZH2 and Repressing E-Cadherin in Cervical Cancer : https://figshare.com/articles/_lncRNA_mRNA_co_expression_network_/1098837
310 | paper:2016-Microarray Analysis of lncRNA and mRNA Expression Profiles in Patients with Neuromyelitis Optica : http://link.springer.com/article/10.1007/s12035-016-9754-0
311 | paper:2015-LncRNA expression profiles reveal the co-expression network in human colorectal carcinoma  : http://www.ijcep.com/files/ijcep0017983.pdf
312 | 
313 | 
314 | 
315 | 
316 | step10: analysis of lncRNA-miRNA interactions
317 | 
318 | paper:2014-starBase-decoding miRNA-ceRNA, miRNA-ncRNA and protein–RNA interaction networks from large-scale CLIP-Seq data : http://nar.oxfordjournals.org/content/early/2013/11/30/nar.gkt1248.short
319 | paper:2014-An Integrated Analysis of miRNA, lncRNA, and mRNA Expression Profiles : http://www.hindawi.com/journals/bmri/2014/345605/abs/
320 | paper:2013-Systematic transcriptome wide analysis of lncRNA-miRNA interactions  : http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0053823
321 | Figure: Regulatory cancer network of lncRNA-miRNA interactions. : http://www.aimspress.com/article/10.3934/molsci.2016.2.104/fulltext.html
322 | paper:2014-Functional interactions among microRNAs and long noncoding RNAs :http://www.sciencedirect.com/science/article/pii/S1084952114001700
323 | paper:2014-NPInter v2.0-an updated database of ncRNA interactions: http://nar.oxfordjournals.org/content/42/D1/D104.short
324 | paper:2013-Long Noncoding RNAs-Related Diseases, Cancers, and Drugs: http://www.hindawi.com/journals/tswj/2013/943539/abs/
325 | 
326 | step11:  Histone Modifications and lncRNA Expression
327 | paper:2013-Panning for Long Noncoding RNAs : http://www.mdpi.com/2218-273X/3/1/226/htm
328 | 
329 | 
330 | 2015-Analysis of long non-coding RNAs highlights tissue-specific expression patterns and epigenetic profiles in normal and psoriatic skin : https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0570-4
331 | 2013-Predicting long non-coding RNAs using RNA sequencing : http://www.ncbi.nlm.nih.gov/pubmed/23541739
332 | 2014-Identification of prostate cancer LncRNAs by RNA-Seq ： http://www.ncbi.nlm.nih.gov/pubmed/25422238
333 | book: Identification of Disease-Related Genes by NGS: http://www.diss.fu-berlin.de/diss/servlets/MCRFileNodeServlet/FUDISS_derivate_000000015470/Dorn_Cornelia.diss2.pdf
334 | book: yeast function genomic : http://download.springer.com/static/pdf/150/bok%253A978-1-4939-3079-1.pdf?originUrl=http%3A%2F%2Flink.springer.com%2Fbook%2F10.1007%2F978-1-4939-3079-1&token2=exp=1467776101~acl=%2Fstatic%2Fpdf%2F150%2Fbok%25253A978-1-4939-3079-1.pdf%3ForiginUrl%3Dhttp%253A%252F%252Flink.springer.com%252Fbook%252F10.1007%252F978-1-4939-3079-1*~hmac=4aaa3f402c498fffa9c609d6ab14c14c0eba3b7862a526ecfddf30c0cf7fb81f
335 | RNA-seq workflow tophat+cufflinks+R: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/1100005/20130522GACDRNASeqandMethylation.pdf
336 | 
337 | mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc) 
338 | Quantitative gene profiling of long noncoding RNAs with targeted RNA sequencing  : http://www.nature.com/nmeth/journal/v12/n4/full/nmeth.3321.html 
339 | Targeted RNA sequencing reveals the deep complexity of the human transcriptome	http://cole-trapnell-lab.github.io/pdfs/papers/mercer-capture-seq.pdf
340 | Targeted sequencing for gene discovery and quantification using RNA CaptureSeq http://www.ncbi.nlm.nih.gov/pubmed/24705597
341 | 2011年 RNA CaptureSeq技术出现 ： http://www.ebiotrade.com/newsf/2011-11/20111117145845614.htm
342 | 


--------------------------------------------------------------------------------
/public-mutation-database:
--------------------------------------------------------------------------------
  1 | Public mutation data
  2 |  
  3 | almost all variant callers (SamTools, SOAPSNP, SOLiD BioScope, Illumina CASAVA, CG ASM-var, CG ASM-masterVAR, etc) use a different file format for output files, Later on, VCF (Variant Call Format) becomes the main stream format for describing variants. It was originally developed and used by the 1000 Genomes Project, but its specification and extension is currently handled by the Global Alliance for Genomics and Health Data Working group. See here for details on its format specification.
  4 | Ref: http://annovar.openbioinformatics.org/en/latest/articles/VCF/ 
  5 | 	samples	records	info			
  6 | CGI69	69	27,603,800	AF,AC,AN,POS			
  7 | ESP6500	6500	1,992,006	AF,AR,POS,SVM			
  8 | ExAC-0.3	60,706	9,865,276	AF,AC,AN,HOM,HET,HEMI			
  9 | 1000ph3v5		39,728,373	AF,AR,POS,SNPSOURCE			
 10 | ICGC-r18		12,647,641	CN,RA(cancer type)			
 11 | Hapmap-v3		1,462,978	AF,AR,POS,			
 12 | Dbsnp144		146,638,225				
 13 | Cosmic-v74		2,153,353				
 14 | 
 15 | CGI69
 16 | 是CG公司测了69个人后注释的VCF信息，注释信息就是标准的AF,AC,AN,POS
 17 | ##fileformat=VCFv4.0
 18 | ##INFO=<ID=CGI69_NS,Number=1,Type=Integer,Description="Complete Genomics 69 Public Genomes, number of samples with fully called data">
 19 | ##INFO=<ID=CGI69_AF,Number=1,Type=Integer,Description="Complete Genomics 69 Public Genomes, allele frequency. A value L if the total allele called is < 30.">
 20 | ##INFO=<ID=CGI69_AC,Number=1,Type=Integer,Description="Complete Genomics 69 Public Genomes, allele count">
 21 | ##INFO=<ID=CGI69_AN,Number=1,Type=Integer,Description="Complete Genomics 69 Public Genomes, total allele called">
 22 | ##INFO=<ID=CGI69_POS,Number=0,Type=Flag,Description="Complete Genomics 69 Public Genomes, flag for matching the position and the reference allele but not the alternative allele">
 23 | #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
 24 | chr1    11014   .       G       A       .       .       CGI69_NS=0;CGI69_AC=1;CGI69_AN=1;CGI69_AF=L
 25 | chr1    11022   .       G       A       .       .       CGI69_NS=3;CGI69_AC=1;CGI69_AN=7;CGI69_AF=L
 26 | chr1    11075   .       A       G       .       .       CGI69_NS=0;CGI69_AC=1;CGI69_AN=1;CGI69_AF=L
 27 | 此数据是CG公司用自己的测序仪测了69个人 (69 standard, non-diseased samples) ，公布的vcf数据，共涉及道2760,3800位点。
 28 | ftp://ftp2.completegenomics.com/ 
 29 | http://www.completegenomics.com/public-data/ 
 30 | http://www.completegenomics.com/public-data/69-Genomes/
 31 | ESP6500
 32 | NHLBI 
 33 | Exome Sequencing Project (ESP)
 34 | Exome Variant Server(EVS)
 35 | The current version dataset is comprised of a set of 2203 African-Americans and 4300 European-Americans unrelated individuals, totaling 6503 samples 
 36 | The samples included in the ESP6500 were selected from the populations listed on the "Home" tab. Information about these populations can be found through dbGaP. In general, ESP samples were selected to contain controls, the extremes of specific traits (LDL and blood pressure), and specific diseases (early onset myocardial infarction and early onset stroke), and lung diseases. Cohort or phenotype information about any particular individual CAN NOT BE RELEASED. The goal of the ESP dataset is to release the frequency counts of specific variants without regard to phenotype. 
 37 | None of the INDEL calls was validated, In general, the INDEL calls are less robust than the SNP calls and have a higher false positive rate. When applying the ESP data to research studies, users are advised to keep this difference in mind. 
 38 | All SNP data were called simultaneously at the University of Michigan (Abecasis Laboratory). The Michigan SNP calling pipeline is available here.
 39 | All INDEL data were analyzed at the Broad Institute (by the Genome Sequencing and Analysis group) using the GATK variation discovery pipeline following the guidelines in the GATK best practices v4. 
 40 | 下载的vcf文件中涉及到的位点不多，就1992006个，注释信息是AF和AR(AR=AC/AN)，POS，多了一个SVM，而且还区分了两个人种。
 41 | ##fileformat=VCFv4.0
 42 | ##INFO=<ID=EVS_AF,Number=1,Type=Float,Description="Exome Variant Server ESP6500-SIv2, allele frequency = allele count / total allele">
 43 | ##INFO=<ID=EVS_AR,Number=.,Type=String,Description="Exome Variant Server ESP6500-SIv2, allele ratio defined as allele count : total allele">
 44 | ##INFO=<ID=EVS_POS,Number=0,Type=Flag,Description="Exome Variant Server ESP6500-SIv2, flag for matching the position and the reference allele but not the alternative allele">
 45 | ##INFO=<ID=EVS_SVM,Number=0,Type=Flag,Description="Exome Variant Server ESP6500-SIv2, flag for failed SVM-based filter at threshold 0.3 (detailed http://evs.gs.washington.edu/EVS/HelpDescriptions.jsp?tab=SnpHelpTab#FilterStatus)">
 46 | #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
 47 | chr1    69428   .       T       G       .       .       EVS_AF=0.0306;EVS_AR=327:10670;EVS_EA_AR=313:6848;EVS_AA_AR=14:3822;EVS_AA_AF=0.0037;EVS_EA_AF=0.0457
 48 | chr1    69476   .       T       C       .       .       EVS_AF=0.0002;EVS_AR=2:10930;EVS_EA_AR=2:7022;EVS_AA_AR=0:3908;EVS_AA_AF=0.0000;EVS_EA_AF=0.0003
 49 | Links: http://evs.gs.washington.edu/EVS/
 50 | ExAC
 51 | 包括了60,706 unrelated individuals(包括了ESP6500)，突变数据共有9865276条记录。记录的注释信息有AF,AC,AN,HOM,HET,HEMI
 52 | Hemizygous/ homozygous/ heterozygous
 53 | 其中每个注释信息还又区分了人种，所以注释变的非常长了。
 54 |  
 55 | EXAC_OTH_AC=0;EXAC_OTH_AN=116;EXAC_OTH_AF=0.000000;EXAC_OTH_HOM=0;EXAC_OTH_HET=0;EXAC_OTH_HEMI=0;EXAC_AMR_AC=0;EXAC_AMR_AN=102;EXAC_AMR_AF=0.000000;EXAC_AMR_HOM=0;EXAC_AMR_HET=0;EXAC_AMR_HEMI=0;EXAC_SAS_AC=0;EXAC_SAS_AN=6626;EXAC_SAS_AF=0.000000;EXAC_SAS_HOM=0;EXAC_SAS_HET=0;EXAC_SAS_HEMI=0;EXAC_AC=5;EXAC_AN=9922;EXAC_AF=0.000504;EXAC_HOM=0;EXAC_HET=5;EXAC_HEMI=0;EXAC_EAS_AC=0;EXAC_EAS_AN=148;EXAC_EAS_AF=0.000000;EXAC_EAS_HOM=0;EXAC_EAS_HET=0;EXAC_EAS_HEMI=0;EXAC_AFR_AC=0;EXAC_AFR_AN=414;EXAC_AFR_AF=0.000000;EXAC_AFR_HOM=0;EXAC_AFR_HET=0;EXAC_AFR_HEMI=0;EXAC_NFE_AC=5;EXAC_NFE_AN=2510;EXAC_NFE_AF=0.001992;EXAC_NFE_HOM=0;EXAC_NFE_HET=5;EXAC_NFE_HEMI=0;EXAC_FIN_AC=0;EXAC_FIN_AN=6;EXAC_FIN_AF=0.000000;EXAC_FIN_HOM=0;EXAC_FIN_HET=0;EXAC_FIN_HEMI=0
 56 | LIKNS : http://exac.broadinstitute.org/ 
 57 | http://exac.broadinstitute.org/about 
 58 | Hapmap
 59 | 这个数据里面记录的数据比较简单，就AF和AR(AR=AC/AN)，POS，共1462978条记录！
 60 | ##fileformat=VCFv4.0
 61 | ##INFO=<ID=HAP_AF,Number=1,Type=Float,Description="Hapmap 3.3 allele frequency = allele count / total allele">
 62 | ##INFO=<ID=HAP_AR,Number=.,Type=String,Description="Hapmap 3.3 allele ratio defined as allele count : total allele">
 63 | ##INFO=<ID=HAP_POS,Number=0,Type=Flag,Description="Hapmap 3.3 flag for matching the position and the reference allele but not the alternative allele">
 64 | #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
 65 | chr1    566875  .       C       T       .       .       HAP_AF=0.02369;HAP_AR=66:2786
 66 | chr1    567753  .       A       G       .       .       HAP_AF=0.00404;HAP_AR=11:2724
 67 | 1000 genome project 3
 68 | 千人基因组计划三期的突变数据要复杂一点，除了AF和AR(AR=AC/AN)，POS，SNPSOURCE还区分了人种
 69 |  
 70 | 共39728373条记录
 71 | #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
 72 | chrM    3       .       T       C       .       .       1KG_AF=0.0009;1KG_AR=1:1073
 73 | chrM    10      .       T       C       .       .       1KG_AF=0.0019;1KG_AR=2:1074
 74 | ICGC mutation
 75 | ICGC是一个组织，TCGA是它组织的计划
 76 | The International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) cancer genomics projects
 77 | 下载的vcf突变数据里面记录着两种数据，共12647641条记录。
 78 | CN->  tumor sample count.
 79 | RA->   variant ratio: tumor sample count/ disease study tested sample count.
 80 | 但是这两个数据都区分了癌症种类，还有样本收集的国家，癌种非常多
 81 | "ICGC r18, from TCGA, Uterine Corpus Endometrial Carcinoma- TCGA, US, 
 82 | "ICGC r18, Lung Cancer - CN, 
 83 | "ICGC r18, from TCGA, Ovarian Serous Cystadenocarcinoma - TCGA, US, 
 84 | "ICGC r18, from TCGA, Head and Neck Squamous Cell Carcinoma - TCGA, US, 
 85 | "ICGC r18, from TCGA, Cervical Squamous Cell Carcinoma - TCGA, US, 
 86 | "ICGC r18, Pediatric Brain Cancer - DE, 
 87 | "ICGC r18, Bladder Cancer - CN, 
 88 | "ICGC r18, Prostate Cancer - CA, 
 89 | "ICGC r18, from TCGA, Brain Glioblastoma Multiforme - TCGA, US, 
 90 | "ICGC r18, Neuroblastoma - TARGET, US, 
 91 | "ICGC r18, Pancreatic Cancer - CA, 
 92 | "ICGC r18, Liver Cancer - NCC, JP, 
 93 | "ICGC r18, Bladder Urothelial Cancer - TGCA, US, 
 94 | "ICGC r18, from TCGA, Gastric Adenocarcinoma - TCGA, US, 
 95 | "ICGC r18, Soft Tissue Cancer - Ewing sarcoma - FR, 
 96 | "ICGC r18, Acute Lymphoblastic Leukemia - TARGET, US, 
 97 | "ICGC r18, Pancreatic Cancer - IT, 
 98 | "ICGC r18, from TCGA, Brain Lower Grade Glioma - TCGA, US, 
 99 | "ICGC r18, Ovarian Cancer - AU, 
100 | "ICGC r18, Liver Cancer - RIKEN, JP, 
101 | "ICGC r18, Prostate Cancer - UK, 
102 | "ICGC r18, Oral Cancer - IN, 
103 | "ICGC r18, from TCGA, Kidney Renal Clear Cell Carcinoma - TCGA, US, 
104 | "ICGC r18, Early Onset Prostate Cancer - DE, 
105 | "ICGC r18, from TCGA, Rectum Adenocarcinoma - TCGA, US, 
106 | "ICGC r18, Colorectal Cancer - CN, 
107 | "ICGC r18, from TCGA, Lung Squamous Cell Carcinoma - TCGA, US, 
108 | "ICGC r18, from TCGA, Kidney Renal Papillary Cell Carcinoma - TCGA, US, 
109 | "ICGC r18, Benign Liver Tumour - FR, 
110 | "ICGC r18, Liver Cancer - Hepatocellular macronodules - FR, 
111 | "ICGC r18, Pancreatic Cancer - AU, 
112 | "ICGC r18, Pancreatic Cancer Endocrine neoplasms - AU, 
113 | "ICGC r18, from TCGA, Acute Myeloid Leukemia - TCGA, US, 
114 | "ICGC r18, Renal Cell Cancer - EU/FR, 
115 | "ICGC r18, from TCGA, Pancreatic Cancer - TCGA, US, 
116 | "ICGC r18, Malignant Lymphoma - DE, 
117 | "ICGC r18, from TCGA, Liver Hepatocellular carcinoma - TCGA, US, 
118 | "ICGC r18, Gastric Cancer - CN, 
119 | "ICGC r18, from TCGA, Breast Cancer - TCGA, US, 
120 | "ICGC r18, Liver Cancer - FR, 
121 | "ICGC r18, from TCGA, Lung Adenocarcinoma - TCGA, US, 
122 | "ICGC r18, from TCGA, Head and Neck Thyroid Carcinoma - TCGA, US, 
123 | "ICGC r18, from TCGA, Prostate Adenocarcinoma - TCGA, US, 
124 | "ICGC r18, from TCGA, Colon Adenocarcinoma - TCGA, US, 
125 | "ICGC r18, Bone Cancer - UK, 
126 | "ICGC r18, Acute Myeloid Leukemia - SK, 
127 | "ICGC r18, from TCGA, Skin Cutaneous melanoma - TCGA, US, 
128 | "ICGC r18, Breast Triple Negative/Lobular Cancer - UK, 
129 | "ICGC r18, Chronic Lymphocytic Leukemia - ES, 
130 | "ICGC r18, Renal clear cell carcinoma - CN, 
131 | "ICGC r18, Chronic Myeloid Disorders - UK, 
132 | "ICGC r18, Lung Cancer - SK, 
133 | "ICGC r18, Thyroid Cancer - SA, 
134 | "ICGC r18, Esophageal Cancer - CN, 
135 | "ICGC r18, Esophageal Adenocarcinoma - UK,
136 | 突变数据如下：
137 | chrM    71      MU16879057      G       A       .       .       ICGC_TCGA_CN=1;ICGC_CN=1;ICGC_CN_PACA-AU=1;ICGC_RA_PACA-AU=0.00255
138 | chrM    72      MU16378494      T       C       .       .       ICGC_TCGA_CN=3;ICGC_CN=3;ICGC_CN_RECA-EU=3;ICGC_RA_RECA-EU=0.03158
139 | chrM    72      MU8785055       T       G       .       .       ICGC_TCGA_CN=1;ICGC_CN=1;ICGC_CN_PACA-AU=1;ICGC_RA_PACA-AU=0.00255
140 | chrM    114     MU16755427      C       A       .       .       ICGC_TCGA_CN=1;ICGC_CN=1;ICGC_CN_PACA-AU=1;ICGC_RA_PACA-AU=0.00255
141 | LINKS: https://icgc.org/
142 | 
143 | dbSNP144 
144 | 是NCBI发布的最新版的snp记录，共146638225条记录
145 | 里面的字段记录着的信息不多，一个是rs的ID号，出现在这个dbsnp144里面的snp肯定是都编号了的，1.4亿条记录都有ID号。
146 | 还有一个字段是记录是否是somatic突变，但是真正有记录的很少很少，大多数都是UNKN
147 | UNKN	146576120
148 | BOTH	39911
149 | GERM	22185
150 | ##fileformat=VCFv4.0
151 | ##INFO=<ID=DBS_SBD,Number=.,Type=String,Description="dbSNP144, first dbSNP Build for RS to appear, RSid and dbSNP build pair values">
152 | ##INFO=<ID=DBS_STATUS,Number=.,Type=String,Description="dbSNP144, variant status: [SOMA- somatic mutation], [GERM- germline mutation], [BOTH- both somatic and germline], [UNKN- status is unknown]">
153 | ##INFO=<ID=DBS_VLD,Number=0,Type=Flag,Description="dbSNP144, flag as valid if variant has 2+ minor allele count based on frequency or genotype data">
154 | ##INFO=<ID=DBS_PM,Number=0,Type=Flag,Description="dbSNP144, flag if variant is pubmed cited or clinical">
155 | ##INFO=<ID=DBS_G5,Number=0,Type=Flag,Description="dbSNP144, flag if variant has >5% minor allele frequency in 1+ populations">
156 | ##INFO=<ID=DBS_OM,Number=0,Type=Flag,Description="dbSNP144, flag if variant has OMIM/OMIA">
157 | ##INFO=<ID=DBS_POS,Number=0,Type=Flag,Description="dbSNP144, flag for matching the position and the reference allele but not the alternative allele">
158 | #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
159 | chr1    10019   .       TAA     TA      .       .       DBS_SBD=rs775809821:144;DBS_STATUS=UNKN
160 | chr1    10055   .       TAA     TAAA    .       .       DBS_SBD=rs768019142:144;DBS_STATUS=UNKN
161 | 
162 | CMC-cosmic-v74
163 | 最新版本是74，共2153353条记录。
164 |  
165 | 里面记录的信息非常多
166 | ##fileformat=VCFv4.0
167 | ##INFO=<ID=CMC_HGNCID,Number=.,Type=String,Description="COSMIC v74, HGNC gene id list.">
168 | ##INFO=<ID=CMC_GENEID,Number=.,Type=String,Description="COSMIC v74, gene symbol list.">
169 | ##INFO=<ID=CMC_SAMPLECNT,Number=1,Type=Integer,Description="COSMIC v74, sample count.">
170 | ##INFO=<ID=CMC_MTYPE,Number=.,Type=String,Description="COSMIC v74, mutation type that has been identified, typically substitution, insertion, deletion, compl
171 | ##INFO=<ID=CMC_AA,Number=.,Type=String,Description="COSMIC v74, list of amino acid change [accession number of contig/transcript]">
172 | ##INFO=<ID=CMC_STATUS,Number=.,Type=String,Description="COSMIC v74, mutation status: [SOMA- confirmed somatic mutation], [MSPL- somatic mutation seen at least
173 | ##INFO=<ID=CMC_SAMPLENAME,Number=.,Type=String,Description="COSMIC v74, sample name list.">
174 | ##INFO=<ID=CMC_SAMPLEID,Number=.,Type=String,Description="COSMIC v74, sample id list.">
175 | ##INFO=<ID=CMC_TUMORID,Number=.,Type=String,Description="COSMIC v74, tumor id list.">
176 | ##INFO=<ID=CMC_SITE,Number=.,Type=String,Description="COSMIC v74, tumor site list (primary:subtype).">
177 | ##INFO=<ID=CMC_HISTOLOGY,Number=.,Type=String,Description="COSMIC v74, tumor histology list (primary:subtype).">
178 | ##INFO=<ID=CMC_MUTID,Number=.,Type=String,Description="COSMIC v74, mutation id.">
179 | ##INFO=<ID=CMC_SAMPLESRC,Number=.,Type=String,Description="COSMIC v74, sample source.">
180 | ##INFO=<ID=CMC_TUMORORIGIN,Number=.,Type=String,Description="COSMIC v74, tumor origin.">
181 | ##INFO=<ID=CMC_POS,Number=0,Type=Flag,Description="COSMIC v74, flag for matching the position and the reference allele but not the alternative allele">
182 | ##INFO=<ID=CMC_PUBMED,Number=.,Type=String,Description="COSMIC v74, list of Pubmed reference">
183 | 其中CMC_STATUS记录着该mutation是否为somatic
184 |  
185 | Links: http://cancer.sanger.ac.uk/cosmic  
186 | http://cancer.sanger.ac.uk/cosmic/datasheets 
187 | 


--------------------------------------------------------------------------------
/translate_some_blogs:
--------------------------------------------------------------------------------
  1 | A farewell to bioinformatics
  2 | March 26, 2012
  3 | I'm leaving bioinformatics to go work at a software company with more technically ept people and for a lot more money. This seems like an opportune time to set forth my accumulated wisdom and thoughts on bioinformatics.
  4 | My attitude towards the subject after all my work in it can probably be best summarized thus: "Fuck you, bioinformatics. Eat shit and die."
  5 | Bioinformatics is an attempt to make molecular biology relevant to reality. All the molecular biologists, devoid of skills beyond those of a laboratory technician, cried out for the mathematicians and programmers to magically extract science from their mountain of shitty results.
  6 | And so the programmers descended and built giant databases where huge numbers of shitty results could be searched quickly. They wrote algorithms to organize shitty results into trees and make pretty graphs of them, and the molecular biologists carefully avoided telling the programmers the actual quality of the results. When it became obvious to everyone involved that a class of results was worthless, such as microarray data, there was a rush of handwaving about "not really quantitative, but we can draw qualitative conclusions" followed by a hasty switch to a new technique that had not yet been proved worthless.
  7 | And the databases grew, and everyone annotated their data by searching the databases, then submitted in turn. No one seems to have pointed out that this makes your database a reflection of your database, not a reflection of reality. Pull out an annotation in GenBank today and it's not very long odds that it's completely wrong.
  8 | Compare this with the most important result obtained by sequencing to date: Woese et al's discovery of the archaea. (Did you think I was going to say the human genome? Fuck off. That was a monument to the vanity of that god-bobbering asshole Francis Collins, not a science project.) They didn't sequence whole genomes, or even whole genes. They sequenced a small region of the 16S rRNA, and it was chosen after pilot experiments and careful thought. The conclusions didn't require giant computers, and they didn't require precise counting of the number of templates. They knew the limitations of their tools.
  9 | Then came clinical identification, done in combination with other assays, where a judicious bit of sequencing could resolve many ambiguities. Similarly, small scale sequencing has been an incredible boon to epidemiology. Indeed, its primary scientific use is in ecology. But how many molecular biologists do you know who know anything about ecology? I can count the ones I know on one hand.
 10 | And sequencing outside of ecology? Irene Pepperberg's work with Alex the parrot dwarfs the scientific contributions of all other sequencing to date put together.
 11 | This all seems an inauspicious beginning for a field. Anything so worthless should quickly shrivel up and die, right? Well, intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques and the slowest languages, by not publishing their algorithms and making their results impossible to replicate, the field managed to reduce its productivity by at least 90%, probably closer to 99%. Thus the thread of failures can be stretched out from years to decades, hidden by the cloak of incompetence.
 12 | And the rhetoric! The call for computational capacity, most of which is wasted! There are only two computationally difficult problems in bioinformatics, sequence alignment and phylogenetic tree construction. Most people would spend a few minutes thinking about what was really important before feeding data to an NP complete algorithm. I ran a full set of alignments last night using the exact algorithms, not heuristic approximations, in a virtual machine on my underpowered laptop yesterday afternoon, so we're not talking about truly hard problems. But no, the software is written to be inefficient, to use memory poorly, and the cry goes up for bigger, faster machines! When the machines are procured, even larger hunks of data are indiscriminately shoved through black box implementations of algorithms in hopes that meaning will emerge on the far side. It never does, but maybe with a bigger machine...
 13 | Fortunately for you, no one takes me seriously. The funding of molecular biology and bioinformatics is safe, protected by a wall of inbreeding, pointless jargon, and lies. So you all can rot in your computational shit heap. I'm gone.
 14 | 
 15 | Why do bioinformatics?
 16 | By Guillaume Filion, filed under software pollution, benchmark, bioinformatics. 
 17 | 
 18 | • 20 May 2015 •
 19 | I never planned to do bioinformatics. It just happened because I liked the time in front of my computer. Still, as every sane individual, I sometimes think that I could do something else with my life, and I wonder whether I am doing the right thing. On this topic, I recently came across the famous farewell to bioinformatics by Frederick J. Ross, which is worth reading, and of which the most emblematic quote is definitely the following.
 20 | My attitude towards the subject after all my work in it can probably be best summarized thus: Fuck you, bioinformatics. Eat shit and die.
 21 | There is nothing to agree or disagree in this quote, but Frederick gives further detail about his point of view in the post. In short, bioinformaticians are bad programmers, and community-level obfuscation maintains the illusion.
 22 | By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques and the slowest languages, by not publishing their algorithms and making their results impossible to replicate, the field managed to reduce its productivity by at least 90%, probably closer to 99%.
 23 | There are indeed many issues in the bioinformatics community and I am on Frederick’s side regarding file formats. For instance, I have huge respect for the maintainers of the BAM/SAM format, but here is a quote, straight from thedocumentation*.
 24 | You do not need to know anything about C to notice that the description does not match. At some point, the core storage format of BAM has changed (just that!) and the old documentation got mixed up with the new one. So much for a planetary standard.
 25 | But no discussion of bioinformatics nonsense would be complete without a benchmark section. In our last software article, we were asked to run our benchmark against an all-pairs algorithm called slidesort. The original benchmark of slidesort concealed two minor details: that it takes months to return, and that it is not an all-pairs algorithm. The email of the maintainers being obsolete, we had to put some effort into finding the authors to ask for explanations. The answer was that it was probably a bug. But “bug” is too polite, “software pollution” is more appropriate.
 26 | ... so why do bioinformatics?
 27 | The answer is simple: because it matters. Even though I deeply agree with Frederick, not everything boils down to working with skilful people. The impact of bioinformatics is unacknowledged but visible. How many discoveries started with a BLAST search? How many experiments were possible only because the human genome is sequenced? Besides, not every problem in bioinformatics is about memory footprint and CPU cycles; in some cases there are lives at stake. Choosing a treatment for cancer patients, deciding upon an abortion based on genotype data, initiating a vaccination campaign... and so much more.
 28 | Bioinformatics is biology, and it matters.
 29 | Is software a primary product of science?
 30 | When we were done writing Best Practices for Scientific Computing, we tried submitting it to a different high-profile journal than the one that ultimately accepted it (PLoS Biology, where it went on to become the most highly read article of 2014 in PLoS Biology). The response from the editor went something like this: "We recognize the importance of good engineering, but we regard writing software as equivalent to building a telescope - it's important to do it right, but we don't regard a process paper on how to build telescopes better as an intellectual contribution." (Disclaimer: I can't find the actual response, so this is a paraphrase, but it was definitely a "no" and for about that reason.)
 31 | Is scientific software like instrumentation?
 32 | When I think about scientific software as a part of science, I inevitably start with its similarities to building scientific instruments. New instrumentation and methods are absolutely essential to scientific progress, and it is clear that good engineering and methods development skills are incredibly helpful in research.
 33 | So, why did the editors at High Profile Journal bounce our paper? I infer that they drew exactly this parallel and thought no further.
 34 | But scientific software is only somewhat like new methods or instrumentation.
 35 | First, software can spread much faster and be used much more like a black box than most methods, and instrumentation inevitably involves either construction or companies that act as middlemen. With software, it's like you're shipping kits or plans for 3-D printing - something that is as close to immediately usable as it comes. If you're going to hand someone an immediately usable black box (and pitch it as such), I would argue that you should take a bit more care in building said black box.
 36 | Second, complexity in software scales much faster than in hardware (citation needed). This is partly due to human nature & a failure to think long-term, and partly due to the nature of software - software can quickly have many more moving parts than hardware, and at much less (short term) cost. Frankly, most software stacks resemble massive Rube Goldberg machines (read that link!) This means that different processes are needed here.
 37 | Third, at least in my field (biology), we are undergoing a transition to data intensive research, and software methods are becoming ever more important. There's no question that software is going to eat biology just like it's eating the rest of the world, and an increasingly large part of our primary scientific output in biology is going to hinge directly on computation (think: annotations. 'nuff said).
 38 | If we're going to build massively complex black boxes that under-pin all of our science, surely that means that theprocess is worth studying intellectually?
 39 | Is scientific software a primary intellectual output of science?
 40 | No.
 41 | I think concluding that it is is an example of the logical fallacy "affirming the consequent" - or, "confusion of necessity and sufficiency". I'm not a logician, but I would phrase it like this (better phrasing welcome!) --
 42 | Good software is necessary for good science. Good science is an intellectual contribution. Therefore good software is an intellectual contribution.
 43 | Hopefully when phrased that way it's clear that it's nonsense.
 44 | I'm naming this "the fallacy of grad student hackers", because I feel like it's a common failure mode of grad students that are good at programming. I actually think it's a tremendously dangerous idea that is confounding a lot of the discussion around software contributions in science.
 45 | To illustrate this, I'll draw the analog to experimental labs: you may have people who are tremendously good at doing certain kinds of experiments (e.g. expert cloners, or PCR wizards, or micro-injection aficionados, or WMISH bravados) and with whom you can collaborate to rapidly advance your research. They can do things that you can't, and they can do them quickly and well! But these people often face dead ends in academia and end up as eterna-postdocs, because (for better or for worse) what is valued for first authorship and career progression is intellectual contribution, and doing experiments well is not sufficient to demonstrate an intellectual contribution. Very few people get career advancement in science by simply being very good at a technique, and I believe that this is OK.
 46 | Back to software - writing software may become necessary for much of science but I don't think it should ever be sufficient as a primary contribution. Worse, it can become (often becomes?) an engine of procrastination. Admittedly, that procrastination leads to things like IPython Notebook, so I don't want to ding it, but neither are all (or even most ;) grad students like Fernando Perez, either.
 47 | Let's admit it, I'm just confused
 48 | This leaves us with a conundrum.
 49 | Software is clearly a force multiplier - "better software, better research!.
 50 | However, I don't think it can be considered a primary output of science. Dan Katz said, "Nobel prizes have been given for inventing instruments. I'm eagerly awaiting for one for inventing software [sic]" -- but I think he's wrong. Nobels have been given because of the insight enabled by inventing instruments, not for inventing instruments. (Corrections welcome!) So while I, too, eagerly await the explicit recognition that software can push scientific insight forward in biology, I am not holding my breath - I think it's going to look much more like the 2013 Chemistry Nobel, which is about general computational methodology. (My money here would be on a Nobel in Medicine for genome assembly methods, which should follow on separately from massively parallel sequencing methods and shotgun sequencing - maybe Venter, Church, and Myers/Pevzner deserve three different Nobels?)
 51 | Despite that, we do need to incentivize it, especially in biology but also more generally. Sean Eddy wrote AN AWESOME BLOG POST ON THIS TOPIC in 2010 (all caps because IT'S AWESOME AND WHY HAVEN'T WE MOVED FURTHER ON THIS <sob>). This is where DOIs for software usually come into play - hey, maybe we can make an analogy between software and papers! But I worry that this is a flawed analogy (for reasons outlined above) and will simply support the wrong idea that doing good hacking is sufficient for good science.
 52 | We also have a new problem - the so-called Big Data Brain Drain, in which it turns out that the skills that are needed for advancing science are also tremendously valuable in much more highly paid jobs -- much like physics number crunchers moving to finance, research professors in biology face a future where all our grad students go on to make more than us in tech. (Admittedly, this is only a problem if we think that more people clicking on ads is more important than basic research.) Jake Vanderplas (the author of the Big Data Brain Drain post) addressed potential solutions to this in Hacking Academia, about which I have mixed feelings. While I love both Jake and his blog post (platonically), there's a bit too much magical thinking in that post -- I don't see (m)any of those solutions getting much traction in academia.
 53 | The bottom line for me is that we need to figure it out, but I'm a bit stuck on practical suggestions. Natural selection may apply -- whoever figures this out in biology (basic research institutions and/or funding bodies) will have quite an edge in advancing biomedicine -- but natural selection works across multiple generations, and I could wish for something a bit faster. But I don't know. Maybe I'll bring it up at SciFoo this year - "Q: how can we kill off the old academic system faster?" :)
 54 | I'll leave you with two little stories.
 55 | The problem, illustrated
 56 | In 2009, we started working on what would ultimately become Pell et al., 2012. We developed a metric shit-ton of software (that's a scientific measure, folks) that included some pretty awesomely scalable sparse graph labeling approaches. The software worked OK for our problem, but was pretty brittle; I'm not sure whether or not our implementation of this partitioning approach is being used by anyone else, nor am I sure if it should be :).
 57 | However, the paper has been a pretty big hit by traditional scientific metrics! We got it into PNAS by talking about the data structure properties and linking physics, computer science, and biology together. It helped lead directly toChikhi and Rizk (2013), and it has been cited a whole bunch of times for (I think) its theoretical contributions. Yay!
 58 | Nonetheless, the incredibly important and tricky details of scalably partitioning 10 bn node graphs were lost from that paper, and the software was not a big player, either. Meanwhile, Dr. Pell left academia and moved on to a big software company where (on his first day) he was earning quite a bit more than me (good on him! I'd like a 5% tithe, though, in the future :) :). Trust me when I say that this is a net loss to academia.
 59 | Summary: good theory, useful ideas, lousy software. Traditional success. Lousy outcomes.
 60 | A contrapositive
 61 | In 2011, we figured out that linear compression ratios for sequence data simply weren't going to cut it in the face of the continued rate of data generation, and we developed digital normalization, a deceptively simple idea that hasn't really been picked up by the theoreticians. Unlike the Pell work above, it's not theoretically well studied at all. Nonetheless, the preprint has a few dozen citations (because it's so darn useful) and the work is proving to be a good foundation for further research for our lab. Perhaps the truest measure of its memetic success is that it's been reimplemented by at least three different sequencing centers.
 62 | The software is highly used, I think, and many of our efforts on the khmer software have been aimed at making diginorm and downstream concepts more robust.
 63 | Summary: lousy theory, useful ideas, good software. Nontraditional success. Awesome outcomes.
 64 | Ways forward?
 65 | I simply don't know how to chart a course forward. My current instinct (see below) is to shift our current focus much more to theory and ideas and further away from software, largely because I simply don't see how to publish or fund "boring" things like software development. (Josh Bloom has an excellent blog post that relates to this particular issue: Novelty Squared)
 66 | I've been obsessing over these topics of software and scientific focus recently (see The three porridge bowls of scientific software development and Please destroy this software after publication. kthxbye) because I'm starting to write a renewal for khmer's funding. My preliminary specific aims look something like this:
 67 | Aim 1: Expand low memory and streaming approaches for biological sequence analysis.
 68 | Aim 2: Develop graph-based approaches for analyzing genomic variation.
 69 | Aim 3: Optimize and extend a general purpose graph analysis library
 70 | Importantly, everything to do with software maintenance, support, and optimization is in Aim 3 and is in fact only a part of that aim. I'm not actually saddened by that, because I believe that software is only interesting because of the new science it enables. So I need to sell that to the NIH, and there software quality is (at best) a secondary consideration.
 71 | On the flip side, by my estimate 75% of our khmer funding is going to software maintenance, most significantly in paying down our technical debt. (In the grant I am proposing to decrease this to ~50%.)
 72 | I'm having trouble justifying this dichotomy mentally myself, and I can only imagine what the reviewers might think (although hopefully they will only glance at the budget ;).
 73 | So this highlights one conundrum: given my estimates and my priorities, how would you suggest I square these stated priorities with my funding allocations? And, in these matters, have I been wrong to focus on software quality, or should I have focused instead on accruing technical debt in the service of novel ideas and functionality? Inquiring minds want to know.
 74 | --titus
 75 | More on scientific software
 76 | Fri 24 April 2015
 77 | By C. Titus Brown
 78 | In science.
 79 | tags: sustainabilitysoftware citation
 80 | So I wrote this thing that got an awful lot of comments, many telling me that I'm just plain wrong. I think it's impossible to respond comprehensively :). But here are some responses.
 81 | What is, what could be, and what should be
 82 | In that blog post, I argued that software shouldn't be considered a primary output of scientific research. But I completely failed to articulate a distinction between what we do today with respect to scientific software, what we could be doing in the not-so-distant future, and what we should be doing. Worse, I mixed them all up!
 83 | ________________________________________
 84 | Peer reviewed publications and grants are the current coin of the realm. When we submit papers and grants for peer review, we have to deal with what those reviewers think right now. In bioinformatics, this largely means papers get evaluated on their perceived novelty and impact (even in impact-blind journals). Software papers are generally evaluated poorly on these metrics, so it's hard to publish bioinformatics software papers in visible places, and it's hard to argue in grants to the NIH (and most of the biology-focused NSF) that pure software development efforts are worthwhile. This is what is, and it makes it hard for methods+software research to get publications and funding.
 85 | ________________________________________
 86 | Assuming that you agree that methods+software research is important in bioinformatics, what could we be doing in the near distant future to boost the visibility of methods+software? Giving DOIs to software is one way to accrue credit to software that is highly used, but citations take a long time to pile up, reviewers won't know what to expect in terms of numbers (50 citations? is that a lot?), and my guess is that they will be poorly valued in important situations like funding and advancement. It's an honorable attempt to hack the system and software DOIs are great for other purposes, but I'm not optimistic about their near- or middle-term impact.
 87 | We could also start to articulate values and perspectives to guide reviewers and granting systems. And this is what I'd like to do. But first, let me rant a bit.
 88 | I think people underestimate the hidden mass in the scientific iceberg. Huge amounts of money are spent on research, and I would bet that there are at least twenty thousand PI-level researchers around the world in biology. In biology-related fields, any of these people may be called upon to review your grant or your paper, and their opinions will largely be, well, their own. To get published, funded, or promoted, you need to convince some committee containing these smart and opinionated researchers that what you're doing is both novel and impactful. To do that, you have to appeal largely to values and beliefs that they already hold.
 89 | Moreover, this set of researchers - largely made of people who have reached tenured professor status - sits on editorial boards, funding agency panels, and tenure and promotion committees. None of these boards and funding panels exist in a vacuum, and while to some extent program managers can push in certain directions, they are ultimately beholden to the priorities of the funding agency, which are (in the best case) channeled from senior scientists.
 90 | If you wonder why open access took so damn long to happen, this is one reason - the cultural "mass" of researchers that needs to shift their opinions is huge and unwieldy and resistant to change. And they are largely invisible, and subject to only limited persuasion.
 91 | One of the most valuable efforts we can make is to explore what we should be doing, and place it on a logical and sensical footing, and put it out there. For example, check out the CRA's memo on best practices in Promotion and Tenure of Interdisciplinary Faculty - great and thoughtful stuff, IMO. We need a bunch of well thought out opinions in this vein. What guidelines do we want to put in place for evaluating methods+software? How should we evaluate methods+software researchers for impact? When we fund software projects, what should we be looking for?
 92 | ________________________________________
 93 | And that brings me to what we should be doing, which is ultimately what I am most interested in. For example, I must admit to deep confusion about what a maturity model for bioinformatics software should look like; this feeds into funding requests, which ultimately feeds into promotion and tenure. I don't know how to guide junior faculty in this area either; I have lots of opinions, but they're not well tested in the marketplace of ideas.
 94 | I and others are starting to have the opportunity to make the case for what we should be doing in review panels; what case should we make?
 95 | It is in this vein, then, that I am trying to figure out what value to place on software itself, and I'm interested in how to promote methods+software researchers and research. Neil Saunders had an interesting comment that I want to highlight here: he said,
 96 | My own feeling is that phrases like "significant intellectual contribution" are just unhelpful academic words,
 97 | I certainly agree that this is an imprecise concept, but I can guarantee that in the US, this is one of the three main questions for researchers at hiring, promotion, and tenure. (Funding opportunities and fit are my guesses for the other two.) So I would push on this point: researchers need to appear to have a clear intellectual contribution at every stage of the way, whatever that means. What it means is what I'm trying to explore.
 98 | Software is a tremendously important and critical part of the research endeavor
 99 | ...but it's not enough. That's my story, and I'm sticking to it :).
100 | I feel like the conversation got a little bit sidetracked by discussions of Nobel Prizes (mea partly culpa), and I want to discuss PhD theses instead. To get a PhD, you need to do some research; if you're a bioinformatics or biology grad student who is focused on methods+software, how much of that research can be software, and what else needs to be there?
101 | And here again I get to dip into my own personal history.
102 | I spent 9 years in graduate school. About 6 years into my PhD, I had a conversation with my advisor that went something like this:
103 | ________________________________________
104 | Me, age ~27 - "Hey, Eric, I've got ~two first-author papers, and another one or two coming, along with a bunch of papers. How about I defend my PhD on the basis of that work, and stick around to finish my experimental work as a postdoc?"
105 | Eric - blank look "All your papers are on computational methods. None of them count for your PhD."
106 | Me - "Uhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhmmmmmmmmmm..."
107 | (I did eventually graduate, but only after three more years of experiments.)
108 | In biology, we have to be able to defend our computational contributions in the face of an only slowly changing professoriate. And I'm OK with that, but I think we should make it clear up front.
109 | ________________________________________
110 | Since then, I've graduated three (soon to be five, I hope!) graduate students, one in biology and two in CS. In every single case, they've done a lot of hacking. And in every single case they've been asked to defend their intellectual contribution. This isn't just people targeting my students - I've sat on committees where students have produced masses of experimental data, and if they weren't prepared to defend their experimental design, their data interpretation, and the impact and significance of their data interpretation, they weren't read to defend. This is a standard part of the PhD process at Caltech, at MSU, and presumably at UC Davis.
111 | So: to successfully receive a PhD, you should have to clearly articulate the problem you're tackling, its place in the scientific literature, the methods and experiments you're going to use, the data you got, the interpretation you place on that data, and the impact of their results on current scientific thinking. It's a pretty high bar, and one that I'm ok with.
112 | ________________________________________
113 | One of the several failure modes I see for graduate students is the one where graduate students spend a huge amount of time developing software and more or less assume that this work will lead to a PhD. Why would they be thinking that?
114 | •	Their advisor may not be particularly computational and may be giving poor guidance (which includes poorly explained criteria).
115 | •	Their advisor may be using them (intentionally or unintentionally) - effective programmers are hard to find.
116 | •	The grad student may be resistant to guidance.
117 | I ticked all of these as a graduate student, but I had the advantage of being a 3rd-generation academic, so I knew the score. (And I still ran into problems.) In my previous blog post, I angered and upset some people by my blunt words (I honestly didn't think "grad student hacker fallacy" was so rude ;( but it's a real problem that I confront regularly.
118 | Computational PhD students need to do what every scientific PhD student needs to do: clearly articulate their problem, place it in the scientific literature, define the computational methods and experiments they're going to do/have done, explain the data and their interpretation of it, and explore how it impacts science. Most of this involves things other than programming and running software! It's impossible to put down percent effort estimates that apply broadly, but my guess is that PhD students should spend at least a year understanding your results and interpreting and explaining their work.
119 | Conveniently, however, once you've done that for your PhD, you're ready to go in the academic world! These same criteria (expanded in scope) apply to getting a postdoc, publishing as a postdoc, getting a faculty position, applying for grants, and getting tenure. Moreover, I believe many of the same criteria apply broadly to research outside of academia (which is one reason I'm still strongly +1 on getting a PhD, no matter your ultimate goals).
120 | (Kyle Cranmer's comment on grad student efforts here was perfect.)
121 | Software as...
122 | As far as software being a primary product of research -- Konrad Hinsen nails it. It's not, but neither are papers, and I'm good with both statements :). Read his blog post for the full argument. The important bit is that very little stands on its own; there always needs to be communication effort around software, data, and methods.
123 | Ultimately, I learned a lot by admitting confusion! Dan Katz and Konrad Hinsen pointed out that software is communication, and Kai Blin drew a great analogy between software and experimental design. These are perspectives that I hadn't seen said so clearly before and they've really made me think differently; both are interesting and provocative analogies and I'm hoping that we can develop them further as a community.
124 | How do we change things?
125 | Kyle Cranmer and Rory Kirchner had a great comment chain on broken value systems and changing the system. I love the discussion, but I'm struggling with how to respond. My tentative and mildly unhappy conclusion is that I may have bought into the intellectual elitism of academia a bit too much (see: third generation academic), but this may also be how I've gotten where I am, so... mixed bag? (Rory made me feel old and dull, too, which is pretty cool in a masochistic kind of way.)
126 | One observation is that, in software, novelty is cheap. It's very, very easy to tweak something minorly, and fairly easy to publish it without generating any new understanding. How do we distinguish a future Heng Li or an Aaron Quinlan (who have enabled new science by cleanly solving "whole classes of common problems that you don't even have to think about anymore") from humdrum increment, and reward them properly in the earlier stages of their career? I don't know, but the answer has to be tied to advancing science, which is hard to measure on any short timescale. (Sean Eddy's blog post has the clearest view on solutions that I've yet seen.)
127 | Another observation (nicely articulated by Daisie Huang) is that (like open data) this is another game theoretic situation, where the authors of widely used software sink their time and energy into the community but don't necessarily gain wide recognition for their efforts. There's a fat middle ground of software that's reasonably well used but isn't samtools, and this ecosystem needs to be supported. This is much harder to argue - it's a larger body of software, it's less visible, and it's frankly much more expensive to support. (Carl Boettiger's comment is worth reading here.) The funding support isn't there, although that might change in the next decade. (This is the proximal challenge for me, since I place my own software, khmer, in this "fat middle ground"; how do I make a clear argument for funding?)
128 | Kyle Cranmer and others pointed to some success in "major instrumentation" and methods-based funding and career paths in physics (help, can't find link/tweets!). This is great, but I think it's also worth discussing the overall scale of things. Physics has a few really big and expensive instruments, with a few big questions, and with thousands of engineers devoted to them. Just in sequencing, biology has thousands (soon millions) of rather cheap instruments, devoted to many thousands of questions. If my prediction that software will "eat" this part of the world becomes true, we will need tens of thousands of data intensive biologists at a minimum, most working to some large extent on data analysis and software. I think the scale of the need here is simply much, much larger than in physics.
129 | I am supremely skeptical of the idea that universities as we currently conceive of them are the right home for stable, mature software development. We either need to change universities in the right way (super hard) or find other institutions (maybe easier). Here, the model to watch may well be the Center for Open Science, which produces theOpen Science Framework (among others). My interpretation is that they are trying to merge scientific needs with the open source development model. (Tellingly, they are doing so largely with foundation funding; the federal funding agencies don't have good mechanisms for funding this kind of thing in biology, at least.) This may be the right model (or at least on the path towards one) for sustained software development in the biological sciences: have an institution focused on sustainability and quality, with a small diversity of missions, that can afford to spend the money to keep a number of good software engineers focused on those missions.
130 | ________________________________________
131 | Thanks, all, for the comments and discussions!
132 | --titus
133 | You write: "... novelty is cheap. It's very, very easy to tweak something minorly, and fairly easy to publish it without generating any new understanding."
134 | I suspect that of most researchers saw just that part of the quote out of context; they would readily agree with it but assume you were referring to academic publication in general rather than software in particular. I certainly would. I think you highlight issues that are both distinct to software and issues that are more general.
135 | As you've stated here, software isn't an intellectual contribution, but a means of communicating an intellectual contribution. Some people can use the software to do new research without understanding everything that went into it; just as some people will cite the results of a paper without replicating it from scratch. Others might dig into the raw code just as others dig into raw data; perhaps finding bugs or new directions.
136 | Some software is the result of a deep and direct intellectual discovery; a communication of a new algorithm or new theorem. Other software is instead primarily the communication of an pre-existing idea, communicated in a way that lets more people take advantage of that contribution. Yet this also happens in publications having nothing to do with software -- papers that communicate an idea that first appeared in some other field: a statistical inference method, or a way of thinking. These translational papers are frequently valuable, and can become widely cited and recognized.
137 | Any communication can be done well or poorly, and software is no exception. To me, the good software practices you have often advocated for: unit testing, modularity, documentation; play the same role as writing up a paper with carefully crafted language and figures. Do it well, and more people will be able to use your idea. Do it terribly, and even a great idea won't go far. Do it really well, and even a simple idea can go really far.
138 | Does that mean universities are the home for 'stable, mature software development?' Researchers rely every day on software developed at universities; just as we rely on research published at universities. That doesn't make it analogous to industry standards, any more than a scientific paper is immediately put into medical practice. Yet academics will continue to have ideas that can be communicated best in the medium of software (inferences, algorithms analyses), and doing so with a minimum of typos and jargon is as important here as in writing.
139 | Is academic software that follows best practices really more widely used than software that doesn't? I have only anecdotes; but then the same is true for good writing style. There are some brilliant but impenetrable papers out there that still have huge impact (usually after some writes a wrapper paper for the masses), just as there are widely used but terribly engineered software. There's even some bad ideas that are widely used because some software or some paper has made them broadly accessible.
140 | My point is that novelty, intellectual contributions, and impact of doing science vs writing (or coding) about it isn't unique to software. The battle to balance new ideas against communicating, testing, or translating existing ideas has always been part of science and we always need both, hence both doing research and publishing research have long been emphasized. Likewise, the debate on the value of technical skills spans all research and isn't isolated to undervaluing of research software engineers, and a PhD candidate should be able to articulate their intellectual contributions regardless of how much coding or pipetting they did.
141 | The real difference between software and papers is not in the intellectual contributions they communicate, or in the impact of that communication, but in the proxies we use to measure them. Generally speaking, publications are peer reviewed & software is not. Publications appear in stratified journals with a widely recognized social ranking, software does not. Without even waiting for citations to appear (itself a crude metric; intended for provenance and co-opted for status) we have a widely recognized way to measure those contributions, and make the case for funding activities that lead to those results. In the face of these proxies for quality, we don't always ask about what the intellectual contribution was, and what the ramifications have been. If software contributions competed for spots to be highlighted in the top journals while research papers appeared naked of these proxies like a big preprint archive, would this discussion be different?
142 | I don't know what the right model is or how it will work out. The big questions about supporting long term technical expertise, whether it's a research software engineer or the engineers "operating a 100-ton, 27-km superfluid helium system at 1.9 kelvin" are deep and difficult issues germane to doing big and hard science, and I think we need to tackle these issues as a community of researchers, not just as the sub-community of those who use software. I think our software community can do more to increase the recognition that software is having impact in academic research that is often equal to or beyond that of the best recognized papers. And I think in order for research software to win that reputation, it needs a bit of a similar spit and polish that we lavish on those papers. Perhaps that means we need proxies of quality, but it also means we need well written software.
143 | Software in scientific research
144 | In a recent blog post, Titus Brown asks if software is a primary product of science, and basically says “no” (but do read the post for the details). A blog-post length reply by Daniel Katz comes to the opposite conclusion (again, please read the post before continuing here). I left a short comment on Titus’ blog but also felt compelled to expand this into a blog post of its own – so here it is.
145 | Titus introduces a useful criterion for what “primary product of science” is: could you get a Nobel prize for it? As Dan comments, Nobel prizes in science are awarded for discoveries and inventions. There we no computers when Alfred Nobel set up his foundation, so we have to extrapolate this definition a bit to today’s situation. Is software like a discovery? Clearly not. Like an invention? Perhaps, but it doesn’t fit very well. Dan makes a comparison with scientific writing, i.e. papers, textbooks, etc. Scientific writing is the traditional way to communicate discoveries and inventions. But what scientists get Nobel prizes for is not the papers, but the work described therein. Papers are not primary products of science either, they are just a means of communication. There is a fairly good analogy between papers and their contents on one hand, and software and algorithms on the other hand. And algorithms are very well comparable to discoveries and inventions. Moreover, many of today’s scientific models are in fact expressed as algorithms. My conclusion is that algorithms clearly count as a primary product of science, but software doesn’t. Software is a means of communication, just like papers or textbooks.
146 | The analogy isn’t perfect, however. The big difference between a paper and a piece of software is that you can feed the latter into a computer to make it dosomething. Software is thus a scientific tool a well as a means of communication. In fact, today’s computational science gives more importance to the tool aspect than to the communication aspect. The main questions asked about scientific software are “What does it do?” and “How efficient is it?” When considering software as a means of communication, we would ask questions such as “Is it well-written, clear, elegant?”, “How general is the formulation?”, or “Can I use it as the basis for developing new science?”. These questions are beginning to be heard, in the context of the scientific software crisis and the need for reproducible research. But they are still second thoughts. We actually accept as normal that the scientific contents of software, i.e. the models implemented by it, are understandable only to software specialists, meaning that for the majority of users, the software is just a black box. Could you imagine this for a paper? “This paper is very obscure, but the people who wrote it are very smart, so let’s trust them and base our research on their conclusions.” Did you ever hear such a claim? Not me.
147 | Scientists haven’t yet fully grasped the particular status of software as both an information carrier and a tool. That may be one of the few characteristics they share with lawyers. The latter make a difference between “data” (including written text), which is covered by copyright, and “software”, which is covered by both copyright and licenses, and in some countries also by patents. Superficially, this makes sense, as it reflects the dual nature of software. It suffers, however, from two problems. First of all, the distinction exists only in the intention of the author, which is hard to pin down. Software is just data that can be interpreted as instructions for a computer. One could conceivably write some interpreter that turns previously generated data into software by executing it. Second, and that’s a problem for science, the licensing aspect of software is much more restrictive than the copyright aspect. If you describe an algorithm informally in a paper, you have to deal only with copyright. If you communicate it in executable form, you have to worry about licensing and patents as well, even if your main intention is more precise communication.
148 | I have written a detailed article about the problems resulting from the badly understood dual nature of scientific software, which I won’t repeat here. I have also proposed a solution, the development of formal languages for expressing complex scientific models, and I am experimenting with a concrete approach to get there. I mention this here mainly to motivate my conclusion:
149 | •	Q: Is software a primary product of science?
150 | •	A: No. But neither is a paper or a textbook.
151 | •	Q: Is software a means of communication for primary products of science?
152 | •	A: Yes, but it’s a bad one. We need something better.
153 | 
154 | 总结一下就是：
155 | most bioinformatics software is not very good quality (#1), 
156 | most bioinformatics software is not built by a team (#2), 
157 | licensing is at best a minor component of what makes software widely used (#3), software should have an expiration date (#5), 
158 | most URLs are unstable (#6), 
159 | software should not be "idiot proof" (#7), 
160 | and it shouldn't matter whether you use a specific programming language (#8).
161 | 
162 | 有个博客对这篇文章进行了全面的评价：
163 | http://ivory.idyll.org/blog//2015-response-to-software-myths.html
164 | 尤其是把Lior的第四点批评的体无完肤
165 | What surprises me most about Lior's post, though, is that he's describing the present situation rather accurately, but he's not angry about it. I'm angry, frustrated, and upset by it, and I really want to see a better future -- I'm in science, and biology, partly because I think it can have a real impact on society and health. Software is a key part of that.
166 | 
167 | 1. Somebody will build on your code.
168 | Nope. Ok, maybe not never but almost certainly not. There are many reasons for this. The primary reason in my view is that most bioinformatics software is of very poor quality (more on why this is the case in #2). Who wants to read junk software, let alone try to edit it or build on it? Most bioinformatics software is also targeted at specific applications. Biologists who use application specific software are typically not interested in developing or improving software because methods development is not their main interest and software development is not their best skill. In the computational biology areas I have worked in during the past 20 years (and I have reviewed/tested/examined/used hundreds or even thousands of programs) I can count the software programs that have been extended or developed by someone other than the author on a single hand. Software that has been built on/extended is typically of the framework kind (e.g. SAMtools being a notable example) but even then development of code by anyone other than the author is rare. For example, for the FSA alignment project we used HMMoC, a convenient compiler for HMMs, but has anyone ever built on the codebase? Doesn’t look like it. You may have been told by your PI that your software will take on a life of its own, like Linux, but the data suggests that is simply not going to happen. No, Gnu is Not Unix and your spliced aligner is not the Linux kernel. Most likely you alone will be the only user of your software, so at least put in some comments, because otherwise the first time you have to fix your own bug you won’t remember what you were doing in the code, and that is just sad.
169 | 2. You should have assembled a team to build your software.
170 | Nope. Although most corporate software these days is built by large teams working collaboratively, scientific software is different. I agree with James Taylor, who in the anatomy of successful computational biology software paper stated that ” A lot of traditional software engineering is about how to build software effectively with large teams, whereas the way most scientific software is developed is (and should be) different. Scientific software is often developed by one or a handful of people.” In fact, I knew you were a graduate student because most bioinformatics software is written singlehandedly by graduate students (occasionally by postdocs). This is actually problem (although not your fault!) Students such as yourself graduate, move on to other projects and labs, and stop maintaining (see #5), let alone developing their code. Many PIs insist on “owning” software their students wrote, hoping that new graduate students in their lab will develop projects of graduated students. But new students are reluctant to work on projects of others because in academia improvement of existing work garners much less credit than new work. After all, isn’t that why you were writing new software in the first place? I absolve you of your solitude, and encourage you to think about how you will create the right incentive structure for yourself to improve your software over a period of time that transcends your Ph.D. degree.
171 | 3. If you choose the right license more people will use and build on your program.
172 | Nope. People have tried all sorts of licenses but the evidence suggests the success of software (as measured by usage, or development of the software by the computational biology community) is not correlated with any particular license. One of the most widely used software suites in bioinformatics (if not the most widely used) is the UCSC genome browser and its associated tools. The software is not free, in that even though it is free for academic, non-profit and personal use, it is sold commercially. It would be difficult to argue that this has impacted its use, or retarded its development outside of UCSC. To the contrary, it is almost inconceivable that anyone working in genetics, genomics or bioinformatics has notused the UCSC browser (on a regular basis). In fact, I have, during my entire career, heard of only one single person who claims not to use the browser; this person is apparently boycotting it due to the license. As far as development of the software, it has almost certainly been hacked/modified/developed by many academics and companies since its initial release (e.g. even within my own group). In anatomy of successful computational biology software published in Nature Biotechnology two years ago, a list of “software for the ages” consists of programs that utilize a wide variety of licenses, including Boost, BSD, and GPL/LGPL. If there is any pattern it is that the most common are GPL/LGPL, although I suspect that if one looks at bioinformatics software as a whole those licenses are also the most common in failed software. The key to successful software, it appears, is for it to be useful and usable. Worry more about that and less about the license, because ultimately helping biologists and addressing problems in biomedicine might be more ethical than hoisting the “right” software license flag.
173 | 4. Making your software free for commercial use shows you are not against companies.
174 | Nope. The opposite is true. If you make your software free for commercial use, you are effectively creating a subsidy for companies, one that is funded by your university / your grants. You are a corporate hero! Congratulations! You have found a loophole for transferring scarce public money to the private sector. If you’ve licensed your software with BSD you’ve added another subsidy: a company using your software doesn’t have any reason to share their work with the academic community. There are two reasons why you might want to reconsider offering such subsidies. First, by denying yourself potential profits from sale of your software to industry, you are definitively removing any incentive for future development/maintenance of the software by yourself or future graduate students. Most bioinformatics software, when sold commercially, costs a few thousand dollars. This is a rounding error for companies developing cancer or other drugs at the cost of a billion dollars per drug and a tractable proposition even for startups, yet the money will make a real difference to you three years out from your Ph.D. when you’re earning a postdoc salary. A voice from the future tells you that you’ll appreciate the money, and it will help you remember that you really ought to fix that bug reported on GitHub posted two months ago. You will be part of the global improvement of bioinformatics software. And there is another reason to sell your software to companies: help your university incentivize more and better development of bioinformatics software. At most universities a majority of the royalties from software sales go to the institution (at UC Berkeley, where I work, its 2/3). Most schools, especially public universities, are struggling these days and have been for some time. Help them out in return for their investment in you; you’ll help generate more bioinformatics hires, and increase appreciation for your field. In other words, although it is not always practical or necessary, when possible, please sell your software commercially.
175 | 5. You should maintain your software indefinitely.
176 | Nope. Someday you will die. Before that you will get a job, or not. Plan for your software to have a limited shelf-life, and act accordingly.
177 | 6. Your “stable URL” can exist forever.
178 | Nope. When I started out as a computational biologist in the late 1990s I worked on genome alignment. At the time I was excited about Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. This was a framework for specifying bioinformatics related dynamic programming algorithms, such as the Needleman-Wunsch or Smith-Waterman algorithms. The authors wrote that “A stable URL for Dynamite documentation, help and information ishttp://www.sanger.ac.uk/~birney/dynamite/” Of course the URL is long gone, and by no fault of the authors. The website hosting model of the late 1990s is long extinct. To his credit, Ewan now hosts the Dynamite code on GitHub, following a welcome trend that is likely to extend the life of bioinformatics programs in the future. Will GitHub be around forever? We’ll see. But more importantly, software becomes extinct (or ought to) for reasons other than just 404 errors. For example, returning to sequence alignment, the ClustalW program of 1994 was surpassed in accuracy and speed by many other multiple alignment programs developed in the 2000s. Yet people kept using ClustalW anyway, perhaps because it felt like a “safe bet” with its many citations (eventually in 2011 Clustalw was updated to Clustal Omega). The lag in improving ClustalW resulted in a lot of poor alignments being utilized in genomics studies for a decade (no fault of the authors of ClustalW, but harmful nonetheless). I’ve started the habit ofretiring my programs, via announcement on my website and PubMed. Please do the same when the time comes.
179 | 7. You should make your software “idiot proof”.
180 | Nope. Your users, hopefully biologists (and not other bioinformatics programmers benchmarking your program to show that they beat it) are not idiots. Listen to them. Back in 2004 Nicolas Bray and I published a webserver for the alignment program MAVID. Users were required to input FASTA files. But can you guess why we had to write a script called checkfasta? (hint: the most popular word processor was and is Microsoft Word). We could have stopped there and laughed at our users, but after much feedback we realized the real issue was that FASTA was not necessarily the appropriate input to begin with. Our users wanted to be able to directly input Genbank accessions for alignment, and eventually Nicolas Bray wrote the perl scripts to allow them to do that (the feature lives on here). The take home message for you is that you should deal with user requests wisely, and your time will be needed not only to fix bugs but to address use cases and requested features you never thought of in the first place. Be prepared to treat your users respectfully, and you and your software will benefit enormously.
181 | 8. You used the right programming language for the task.
182 | Nope. First it was perl, now it’s python. First it was MATLAB, now it’s R. First it was  C, then C++.  First it was multithreading now it’s Spark. There is always something new coming, which is yet another reason that almost certainly nobody is going to build on your code. By the time someone gets around to having to do something you may have worked on, there will be better ways. Therefore, the main thing is that your software should be written in a way that makes it easy to find and fix bugs, fast, and efficient (in terms of memory usage). If you can do that in Fortran great. In fact, in some fields not very far from bioinformatics, people do exactly that. My advice: stay away from Fortran (but do adopt some of the best practice advice offered here).
183 | 9. You should have read Lior Pachter’s blog post about the myths of bioinformatics software before starting your project.
184 | Nope. Lior Pachter was an author on the Vista paper describing a program for which the source code was available only “upon request”.
185 | New Insights into Human De Novo Mutations
186 | July 30, 2015 by Dan Koboldt Leave a Comment
187 | De novo mutations — sequence variants that are present in a child but absent from both parents — are an important source of human genetic variation. I think it’s reasonable to say that most of the 3-4 million variants in any individual’s genome arose, once upon a time, as de novo mutations in his or her ancestors. In the past few years, whole-genome sequencing (WGS) studies performed in families (especially parent-child trios) have offered some revelations about de novo mutations and their role in human disease, notably that:
188 | o	The de novo mutation rate for humans is ~1.2e-08 per generation which works out to around 38 mutations genome-wide per offspring
189 | o	95% of the global mutation rate is explained by paternal age (each year adding 1-2 mutations)
190 | o	As much as 2/3 of genetic diagnoses from clinical sequencing efforts are de novo mutations.
191 | A recent study in Nature Genetics provides the largest survey of de novo mutations to date. Laurent Francioli et al identified de novo mutations in 250 Dutch families that were sequenced to ~13x coverage as part of the Genome of the Netherlands (GoNL) project. Their findings confirm much of the observations from previous smaller studies, and offer some new insights into the patterns of de novo mutations throughout the human genome.
192 | Identification of de novo Mutations
193 | To make any global observations about de novo mutations, one generally needs unbiased whole-genome sequencing data for an individual and both parents. Even with those in hand, accurate identification of de novo mutations is challenging because they’re so exquisitely rare. Since the sequencing coverage in this study is a little bit light (13x, whereas most studies shoot for ~30x), I had some initial concerns about whether or not the mutation calls might hold up under scrutiny.
194 | Delving into the online methods, I learned that the samples underwent Illumina paired-end sequencing (2x91bp, insert size 500bp). Alignment and variant calling followed GATK best practices (v2), and the mutations were called with the trio-aware GATK PhaseByTransmission. Next, the authors used a machine learning classifier trained on 592 true positive and 1,630 false positive de novo calls that had been validated experimentally. The net result was 11,020 high-confidence mutations in the 269 children, with an estimated a 92.2% accuracy.
195 | The numbers are about right: if 92.2% of the calls are real, that’s 10,160 true mutations, or ~37.7 mutations per child. That’s very close to the estimated ~38 per genome. In other words, without experimentally validating all 11,000 mutations (an expensive and laborious task), this is as good as it gets.
196 | Parent-of-Origin and Replication Timing
197 |  
198 | Credit: Francioli et al, Nature 2015
199 | The authors first examined whether the location of the observed mutations was correlated with any epigenetic variables. There was no significant correlation for most of the variables examined (chromatin accessibility, histone modifications, and recombination rate).  With a linear regression model, they noted a significant association between replication timing and paternal age: mutations in the offspring of younger fathers (<28 years old) were strongly enriched in late-replicating regions, whereas mutations in offspring of older fathers were not.
200 | To dig deeper, the authors looked at 2,621 mutations that could be unambiguously assigned to maternal or paternal origin. The method for this isn’t documented in the online methods, but presumably they looked for instances in which a mutation was in the same read or read pair as a variant unique to one parent. Notably, 1,991 of those origin-inferred mutations (76%) came from the father. After controlling for the number imbalance, thereplication-timing-with-parent-age correlation was significant only for mutations of paternal origin.
201 | This makes a certain kind of sense, since the stem cells in the paternal germ line undergo continuous cell division throughout a man’s life, whereas a woman is born with all of the eggs she’ll ever have.
202 | The correlation between paternal age and replication timing is important from a reproductive health perspective, because late-replicating regions have lower gene density and expression levels than early ones. Since the mutations in offspring of younger fathers tend to occur in these regions, they’re less likely to have a functional impact. In support of this idea, on average, the offspring born to 40-year-old fathers had twice as many genic mutations as offspring born to 20-year-old fathers.
203 | In other words, mutations in the offspring of older fathers are not only more numerous, but also more likely to have functional consequences.
204 | Mutations in Functional Regions
205 | Notably, the de novo mutation rates in this study were higher in exonic regions regardless of the paternal age. Overall, 1.22% of mutations were exonic, an enrichment of 28.7% over simulated models of random mutation distribution. Mutations were also enriched in DNase I hypersensitive sites (DHSs), which represent likely regulatory regions. The source of this “functional enrichment” likely has to do with sequence context: mutations often occur at CpG dinucleotides, which are themselves more prevalent in exons and DHSs.
206 | Recent studies of somatic mutations in tumor cells revealed a fascinating phenomenon: a reduction in the mutation rate of highly transcribed regions, likely attributed to the fidelity conferred by transcription-coupled DNA repair mechanisms. In the current study of de novo mutations, however, the mutation rate in transcribed regions and DHSs did not appear to be reduced.
207 | The implication here might be that transcription-coupled repair has less of an impact on de novomutations, though the authors note that their study was only powered to detect a substantial difference (>17%) in mutation rate. That’s understandable, because while the individuals examined here harbored ~40 mutations genome-wide, a tumor specimen might have tens of thousands of somatic mutations (i.e. much better power to detect subtle differences in mutation rate).
208 | Clustered de novo Mutations
209 | One of the most interesting observations in this study was a clustering effect of de novo mutations. If all things were random, given the size of the genome (3.2 billion base pairs) and the number of mutations per individual (~40), we expect them to be pretty far apart. As in, one every 80 million base pairs.
210 | Instead, the authors observed 78 instances in which there were “clusters” of 2-3 mutations within a 20kb window in the same individual. The 161 mutations involved showed no significant differences from the non-clustered mutations with regard to recombination rate (p=0.52) or replication timing (p=0.059), though I should point out that the latter might be approaching an interesting p-value.
211 | Interestingly, however, the clustered mutations exhibited an unusual mutational spectrum, with astrong enrichment for C->G transversions compared to non-clustered mutations (p=1.8e-13).
212 |  
213 | Francioli et al, Nature 2015
214 | Based on the nucleotide context, the authors suggest that a new mutational mechanism may be at work involving cytosine deamination of single-stranded DNA (presumably during replication). I don’t have strong enough chemistry to understand the proposed mechanism, but agree that this unusual pattern merits some more investigation.
215 | References
216 | Francioli LC, Polak PP, Koren A, Menelaou A, Chun S, Renkens I, Genome of the Netherlands Consortium, van Duijn CM, Swertz M, Wijmenga C, van Ommen G, Slagboom PE, Boomsma DI, Ye K, Guryev V, Arndt PF, Kloosterman WP, de Bakker PI, & Sunyaev SR (2015). Genome-wide patterns and properties of de novo mutations in humans. Nature genetics, 47 (7), 822-6 PMID: 25985141
217 | 
218 | 为什么要从事生物信息学研究呢？
219 | Web: http://blog.thegrandlocus.com/2015/05/why-do-bioinformatics
220 | 我以前从未想过自己会从事生物信息学研究的，只是因为比较喜欢在电脑面前工作的感觉，所以就做了这个明智的决定。有时候我也想，我的生命中还有很多其它事情等着我去完成，也会怀疑，我现在的选择是最好的吗？正巧我最近看了一篇Frederick的文章《farewell to bioinformatics》，里面讨论了关于生物信息学作为职业的问题，很值得一读。里面最体现了作者观点一句话是：就我从事生物信息学研究的经验来看，我对它的态度很明确，总结起来就是： Fuck you, bioinformatics. Eat shit and die.
221 | 
222 | 对于这样的观点，我不予置否。但是Frederick在他的博客里面对于他的见解做了深度阐述。总而言之，他认为生物信息学家不是一个好的程序员，大部分的工作都是在自欺欺人。总是创造一些没人要的软件，规定一大堆各种文件格式，也不公布算法，这样使得他人无法重复他的工作，这几点使得他们的工作产能减少90%，甚至可以说99%都是无用功。
223 | 
224 | 生物信息学的障碍
225 | Web：http://madhadron.com/posts/2012-03-26-a-farewell-to-bioinformatics.html
226 | 现在我已经离开了生物信息学研究领域，去了一家软件公司，那里有很多技术牛人，而且我也可以拿到更多的报酬。这似乎意味着终于有了个合适的机会来说说我对生物信息学的看法。
227 | 
228 | 就我从事生物信息学研究的经验来看，我对它的态度很明确，总结起来就是： Fuck you, bioinformatics. Eat shit and die.
229 | 生物信息学的作用应该是力图使得微观世界的分子生物学更容易理解。所有的分子生物学家缺乏那些实验室外的技术
230 | 
231 | 
232 | 
233 | 


--------------------------------------------------------------------------------