├── LICENSE
├── README.md
├── bin
└── Tutorial_Module4_ERCC_expression.pl
├── manuscript
├── figures
│ ├── Fig1.eps
│ ├── Fig2.eps
│ ├── Fig3.eps
│ ├── Fig4.eps
│ ├── Fig5.eps
│ ├── Fig6.eps
│ ├── Figure1.ai
│ ├── Figure1.pdf
│ ├── Figure1.png
│ ├── Figure2.ai
│ ├── Figure2.pdf
│ ├── Figure2.png
│ ├── Figure3.ai
│ ├── Figure3.pdf
│ ├── Figure3.png
│ ├── Figure4.ai
│ ├── Figure4.pdf
│ ├── Figure4.png
│ ├── Figure5.ai
│ ├── Figure5.pdf
│ ├── Figure5.png
│ ├── Figure6.ai
│ ├── Figure6.pdf
│ ├── Figure6.png
│ ├── README.md
│ ├── StrikingImage.ai
│ ├── StrikingImage.eps
│ ├── StrikingImage.pdf
│ ├── StrikingImage.png
│ └── raw_materials
│ │ ├── Fig6-pcb-edits.eps
│ │ ├── Figure2
│ │ ├── DSC_0267_cropped.jpg
│ │ ├── Figure_2.ai
│ │ ├── Tumor-normal.ai
│ │ ├── cDNA.ai
│ │ ├── flow-cell.ai
│ │ ├── flowcell-600dpi.jpg
│ │ ├── mapping.ai
│ │ ├── rna.ai
│ │ ├── tissue.ai
│ │ ├── total_rna.ai
│ │ └── unmapped_reads.ai
│ │ ├── Figure5.ppt
│ │ ├── Figure5_new_mockup.jpg
│ │ ├── Figure6C_vector.ai
│ │ ├── Figure6_vector.ai
│ │ ├── RNA-Seq-alignment.png
│ │ ├── Stranded_vs_Unstranded_SoftwareSettings.png
│ │ ├── igv_snapshot_strand.png
│ │ ├── igv_snapshot_strand2.png
│ │ └── igv_snapshot_strand3.png
└── supplementary_tables
│ ├── addUrls.pl
│ ├── supplementary_table_1.md
│ ├── supplementary_table_1_urls.md
│ ├── supplementary_table_2.md
│ ├── supplementary_table_2_urls.md
│ ├── supplementary_table_3.md
│ ├── supplementary_table_4.md
│ ├── supplementary_table_5.md
│ ├── supplementary_table_6.md
│ ├── supplementary_table_7.md
│ ├── supplementary_table_8.md
│ └── supplementary_table_9.md
├── scripts
├── Igv_HCC1143_attributes.txt
├── Run_batch_IGV_snapshots.txt
├── Tutorial_Module4_ERCC_DE.R
├── Tutorial_Module4_ERCC_DE.pdf
├── Tutorial_Module4_ERCC_expression.R
├── Tutorial_Module4_ERCC_expression.pdf
├── Tutorial_Module4_ERCC_expression.pl
├── Tutorial_Module4_Part2_cummeRbund.R
├── Tutorial_Module4_Part2_cummeRbund_output.pdf
├── Tutorial_Module4_Part3_Supplementary_R.R
├── Tutorial_Module4_Part3_Supplementary_R_output.pdf
└── Tutorial_Module4_Part4_edgeR.R
└── setup
├── .bashrc
├── preinstall.sh
└── setup_mounts.sh
/LICENSE:
--------------------------------------------------------------------------------
1 | This material is made available under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license
2 |
3 | https://creativecommons.org/licenses/by-sa/4.0/legalcode
4 |
5 | Attribution-ShareAlike 4.0 International
6 |
7 | =======================================================================
8 |
9 | Creative Commons Corporation ("Creative Commons") is not a law firm and
10 | does not provide legal services or legal advice. Distribution of
11 | Creative Commons public licenses does not create a lawyer-client or
12 | other relationship. Creative Commons makes its licenses and related
13 | information available on an "as-is" basis. Creative Commons gives no
14 | warranties regarding its licenses, any material licensed under their
15 | terms and conditions, or any related information. Creative Commons
16 | disclaims all liability for damages resulting from their use to the
17 | fullest extent possible.
18 |
19 | Using Creative Commons Public Licenses
20 |
21 | Creative Commons public licenses provide a standard set of terms and
22 | conditions that creators and other rights holders may use to share
23 | original works of authorship and other material subject to copyright
24 | and certain other rights specified in the public license below. The
25 | following considerations are for informational purposes only, are not
26 | exhaustive, and do not form part of our licenses.
27 |
28 | Considerations for licensors: Our public licenses are
29 | intended for use by those authorized to give the public
30 | permission to use material in ways otherwise restricted by
31 | copyright and certain other rights. Our licenses are
32 | irrevocable. Licensors should read and understand the terms
33 | and conditions of the license they choose before applying it.
34 | Licensors should also secure all rights necessary before
35 | applying our licenses so that the public can reuse the
36 | material as expected. Licensors should clearly mark any
37 | material not subject to the license. This includes other CC-
38 | licensed material, or material used under an exception or
39 | limitation to copyright. More considerations for licensors:
40 | wiki.creativecommons.org/Considerations_for_licensors
41 |
42 | Considerations for the public: By using one of our public
43 | licenses, a licensor grants the public permission to use the
44 | licensed material under specified terms and conditions. If
45 | the licensor's permission is not necessary for any reason--for
46 | example, because of any applicable exception or limitation to
47 | copyright--then that use is not regulated by the license. Our
48 | licenses grant only permissions under copyright and certain
49 | other rights that a licensor has authority to grant. Use of
50 | the licensed material may still be restricted for other
51 | reasons, including because others have copyright or other
52 | rights in the material. A licensor may make special requests,
53 | such as asking that all changes be marked or described.
54 | Although not required by our licenses, you are encouraged to
55 | respect those requests where reasonable. More_considerations
56 | for the public:
57 | wiki.creativecommons.org/Considerations_for_licensees
58 |
59 | =======================================================================
60 |
61 | Creative Commons Attribution-ShareAlike 4.0 International Public
62 | License
63 |
64 | By exercising the Licensed Rights (defined below), You accept and agree
65 | to be bound by the terms and conditions of this Creative Commons
66 | Attribution-ShareAlike 4.0 International Public License ("Public
67 | License"). To the extent this Public License may be interpreted as a
68 | contract, You are granted the Licensed Rights in consideration of Your
69 | acceptance of these terms and conditions, and the Licensor grants You
70 | such rights in consideration of benefits the Licensor receives from
71 | making the Licensed Material available under these terms and
72 | conditions.
73 |
74 |
75 | Section 1 -- Definitions.
76 |
77 | a. Adapted Material means material subject to Copyright and Similar
78 | Rights that is derived from or based upon the Licensed Material
79 | and in which the Licensed Material is translated, altered,
80 | arranged, transformed, or otherwise modified in a manner requiring
81 | permission under the Copyright and Similar Rights held by the
82 | Licensor. For purposes of this Public License, where the Licensed
83 | Material is a musical work, performance, or sound recording,
84 | Adapted Material is always produced where the Licensed Material is
85 | synched in timed relation with a moving image.
86 |
87 | b. Adapter's License means the license You apply to Your Copyright
88 | and Similar Rights in Your contributions to Adapted Material in
89 | accordance with the terms and conditions of this Public License.
90 |
91 | c. BY-SA Compatible License means a license listed at
92 | creativecommons.org/compatiblelicenses, approved by Creative
93 | Commons as essentially the equivalent of this Public License.
94 |
95 | d. Copyright and Similar Rights means copyright and/or similar rights
96 | closely related to copyright including, without limitation,
97 | performance, broadcast, sound recording, and Sui Generis Database
98 | Rights, without regard to how the rights are labeled or
99 | categorized. For purposes of this Public License, the rights
100 | specified in Section 2(b)(1)-(2) are not Copyright and Similar
101 | Rights.
102 |
103 | e. Effective Technological Measures means those measures that, in the
104 | absence of proper authority, may not be circumvented under laws
105 | fulfilling obligations under Article 11 of the WIPO Copyright
106 | Treaty adopted on December 20, 1996, and/or similar international
107 | agreements.
108 |
109 | f. Exceptions and Limitations means fair use, fair dealing, and/or
110 | any other exception or limitation to Copyright and Similar Rights
111 | that applies to Your use of the Licensed Material.
112 |
113 | g. License Elements means the license attributes listed in the name
114 | of a Creative Commons Public License. The License Elements of this
115 | Public License are Attribution and ShareAlike.
116 |
117 | h. Licensed Material means the artistic or literary work, database,
118 | or other material to which the Licensor applied this Public
119 | License.
120 |
121 | i. Licensed Rights means the rights granted to You subject to the
122 | terms and conditions of this Public License, which are limited to
123 | all Copyright and Similar Rights that apply to Your use of the
124 | Licensed Material and that the Licensor has authority to license.
125 |
126 | j. Licensor means the individual(s) or entity(ies) granting rights
127 | under this Public License.
128 |
129 | k. Share means to provide material to the public by any means or
130 | process that requires permission under the Licensed Rights, such
131 | as reproduction, public display, public performance, distribution,
132 | dissemination, communication, or importation, and to make material
133 | available to the public including in ways that members of the
134 | public may access the material from a place and at a time
135 | individually chosen by them.
136 |
137 | l. Sui Generis Database Rights means rights other than copyright
138 | resulting from Directive 96/9/EC of the European Parliament and of
139 | the Council of 11 March 1996 on the legal protection of databases,
140 | as amended and/or succeeded, as well as other essentially
141 | equivalent rights anywhere in the world.
142 |
143 | m. You means the individual or entity exercising the Licensed Rights
144 | under this Public License. Your has a corresponding meaning.
145 |
146 |
147 | Section 2 -- Scope.
148 |
149 | a. License grant.
150 |
151 | 1. Subject to the terms and conditions of this Public License,
152 | the Licensor hereby grants You a worldwide, royalty-free,
153 | non-sublicensable, non-exclusive, irrevocable license to
154 | exercise the Licensed Rights in the Licensed Material to:
155 |
156 | a. reproduce and Share the Licensed Material, in whole or
157 | in part; and
158 |
159 | b. produce, reproduce, and Share Adapted Material.
160 |
161 | 2. Exceptions and Limitations. For the avoidance of doubt, where
162 | Exceptions and Limitations apply to Your use, this Public
163 | License does not apply, and You do not need to comply with
164 | its terms and conditions.
165 |
166 | 3. Term. The term of this Public License is specified in Section
167 | 6(a).
168 |
169 | 4. Media and formats; technical modifications allowed. The
170 | Licensor authorizes You to exercise the Licensed Rights in
171 | all media and formats whether now known or hereafter created,
172 | and to make technical modifications necessary to do so. The
173 | Licensor waives and/or agrees not to assert any right or
174 | authority to forbid You from making technical modifications
175 | necessary to exercise the Licensed Rights, including
176 | technical modifications necessary to circumvent Effective
177 | Technological Measures. For purposes of this Public License,
178 | simply making modifications authorized by this Section 2(a)
179 | (4) never produces Adapted Material.
180 |
181 | 5. Downstream recipients.
182 |
183 | a. Offer from the Licensor -- Licensed Material. Every
184 | recipient of the Licensed Material automatically
185 | receives an offer from the Licensor to exercise the
186 | Licensed Rights under the terms and conditions of this
187 | Public License.
188 |
189 | b. Additional offer from the Licensor -- Adapted Material.
190 | Every recipient of Adapted Material from You
191 | automatically receives an offer from the Licensor to
192 | exercise the Licensed Rights in the Adapted Material
193 | under the conditions of the Adapter's License You apply.
194 |
195 | c. No downstream restrictions. You may not offer or impose
196 | any additional or different terms or conditions on, or
197 | apply any Effective Technological Measures to, the
198 | Licensed Material if doing so restricts exercise of the
199 | Licensed Rights by any recipient of the Licensed
200 | Material.
201 |
202 | 6. No endorsement. Nothing in this Public License constitutes or
203 | may be construed as permission to assert or imply that You
204 | are, or that Your use of the Licensed Material is, connected
205 | with, or sponsored, endorsed, or granted official status by,
206 | the Licensor or others designated to receive attribution as
207 | provided in Section 3(a)(1)(A)(i).
208 |
209 | b. Other rights.
210 |
211 | 1. Moral rights, such as the right of integrity, are not
212 | licensed under this Public License, nor are publicity,
213 | privacy, and/or other similar personality rights; however, to
214 | the extent possible, the Licensor waives and/or agrees not to
215 | assert any such rights held by the Licensor to the limited
216 | extent necessary to allow You to exercise the Licensed
217 | Rights, but not otherwise.
218 |
219 | 2. Patent and trademark rights are not licensed under this
220 | Public License.
221 |
222 | 3. To the extent possible, the Licensor waives any right to
223 | collect royalties from You for the exercise of the Licensed
224 | Rights, whether directly or through a collecting society
225 | under any voluntary or waivable statutory or compulsory
226 | licensing scheme. In all other cases the Licensor expressly
227 | reserves any right to collect such royalties.
228 |
229 |
230 | Section 3 -- License Conditions.
231 |
232 | Your exercise of the Licensed Rights is expressly made subject to the
233 | following conditions.
234 |
235 | a. Attribution.
236 |
237 | 1. If You Share the Licensed Material (including in modified
238 | form), You must:
239 |
240 | a. retain the following if it is supplied by the Licensor
241 | with the Licensed Material:
242 |
243 | i. identification of the creator(s) of the Licensed
244 | Material and any others designated to receive
245 | attribution, in any reasonable manner requested by
246 | the Licensor (including by pseudonym if
247 | designated);
248 |
249 | ii. a copyright notice;
250 |
251 | iii. a notice that refers to this Public License;
252 |
253 | iv. a notice that refers to the disclaimer of
254 | warranties;
255 |
256 | v. a URI or hyperlink to the Licensed Material to the
257 | extent reasonably practicable;
258 |
259 | b. indicate if You modified the Licensed Material and
260 | retain an indication of any previous modifications; and
261 |
262 | c. indicate the Licensed Material is licensed under this
263 | Public License, and include the text of, or the URI or
264 | hyperlink to, this Public License.
265 |
266 | 2. You may satisfy the conditions in Section 3(a)(1) in any
267 | reasonable manner based on the medium, means, and context in
268 | which You Share the Licensed Material. For example, it may be
269 | reasonable to satisfy the conditions by providing a URI or
270 | hyperlink to a resource that includes the required
271 | information.
272 |
273 | 3. If requested by the Licensor, You must remove any of the
274 | information required by Section 3(a)(1)(A) to the extent
275 | reasonably practicable.
276 |
277 | b. ShareAlike.
278 |
279 | In addition to the conditions in Section 3(a), if You Share
280 | Adapted Material You produce, the following conditions also apply.
281 |
282 | 1. The Adapter's License You apply must be a Creative Commons
283 | license with the same License Elements, this version or
284 | later, or a BY-SA Compatible License.
285 |
286 | 2. You must include the text of, or the URI or hyperlink to, the
287 | Adapter's License You apply. You may satisfy this condition
288 | in any reasonable manner based on the medium, means, and
289 | context in which You Share Adapted Material.
290 |
291 | 3. You may not offer or impose any additional or different terms
292 | or conditions on, or apply any Effective Technological
293 | Measures to, Adapted Material that restrict exercise of the
294 | rights granted under the Adapter's License You apply.
295 |
296 |
297 | Section 4 -- Sui Generis Database Rights.
298 |
299 | Where the Licensed Rights include Sui Generis Database Rights that
300 | apply to Your use of the Licensed Material:
301 |
302 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right
303 | to extract, reuse, reproduce, and Share all or a substantial
304 | portion of the contents of the database;
305 |
306 | b. if You include all or a substantial portion of the database
307 | contents in a database in which You have Sui Generis Database
308 | Rights, then the database in which You have Sui Generis Database
309 | Rights (but not its individual contents) is Adapted Material,
310 |
311 | including for purposes of Section 3(b); and
312 | c. You must comply with the conditions in Section 3(a) if You Share
313 | all or a substantial portion of the contents of the database.
314 |
315 | For the avoidance of doubt, this Section 4 supplements and does not
316 | replace Your obligations under this Public License where the Licensed
317 | Rights include other Copyright and Similar Rights.
318 |
319 |
320 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
321 |
322 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
323 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
324 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
325 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
326 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
327 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
328 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
329 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
330 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
331 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
332 |
333 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
334 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
335 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
336 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
337 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
338 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
339 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
340 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
341 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
342 |
343 | c. The disclaimer of warranties and limitation of liability provided
344 | above shall be interpreted in a manner that, to the extent
345 | possible, most closely approximates an absolute disclaimer and
346 | waiver of all liability.
347 |
348 |
349 | Section 6 -- Term and Termination.
350 |
351 | a. This Public License applies for the term of the Copyright and
352 | Similar Rights licensed here. However, if You fail to comply with
353 | this Public License, then Your rights under this Public License
354 | terminate automatically.
355 |
356 | b. Where Your right to use the Licensed Material has terminated under
357 | Section 6(a), it reinstates:
358 |
359 | 1. automatically as of the date the violation is cured, provided
360 | it is cured within 30 days of Your discovery of the
361 | violation; or
362 |
363 | 2. upon express reinstatement by the Licensor.
364 |
365 | For the avoidance of doubt, this Section 6(b) does not affect any
366 | right the Licensor may have to seek remedies for Your violations
367 | of this Public License.
368 |
369 | c. For the avoidance of doubt, the Licensor may also offer the
370 | Licensed Material under separate terms or conditions or stop
371 | distributing the Licensed Material at any time; however, doing so
372 | will not terminate this Public License.
373 |
374 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
375 | License.
376 |
377 |
378 | Section 7 -- Other Terms and Conditions.
379 |
380 | a. The Licensor shall not be bound by any additional or different
381 | terms or conditions communicated by You unless expressly agreed.
382 |
383 | b. Any arrangements, understandings, or agreements regarding the
384 | Licensed Material not stated herein are separate from and
385 | independent of the terms and conditions of this Public License.
386 |
387 |
388 | Section 8 -- Interpretation.
389 |
390 | a. For the avoidance of doubt, this Public License does not, and
391 | shall not be interpreted to, reduce, limit, restrict, or impose
392 | conditions on any use of the Licensed Material that could lawfully
393 | be made without permission under this Public License.
394 |
395 | b. To the extent possible, if any provision of this Public License is
396 | deemed unenforceable, it shall be automatically reformed to the
397 | minimum extent necessary to make it enforceable. If the provision
398 | cannot be reformed, it shall be severed from this Public License
399 | without affecting the enforceability of the remaining terms and
400 | conditions.
401 |
402 | c. No term or condition of this Public License will be waived and no
403 | failure to comply consented to unless expressly agreed to by the
404 | Licensor.
405 |
406 | d. Nothing in this Public License constitutes or may be interpreted
407 | as a limitation upon, or waiver of, any privileges and immunities
408 | that apply to the Licensor or You, including from the legal
409 | processes of any jurisdiction or authority.
410 |
411 |
412 | =======================================================================
413 |
414 | Creative Commons is not a party to its public licenses.
415 | Notwithstanding, Creative Commons may elect to apply one of its public
416 | licenses to material it publishes and in those instances will be
417 | considered the "Licensor." Except for the limited purpose of indicating
418 | that material is shared under a Creative Commons public license or as
419 | otherwise permitted by the Creative Commons policies published at
420 | creativecommons.org/policies, Creative Commons does not authorize the
421 | use of the trademark "Creative Commons" or any other trademark or logo
422 | of Creative Commons without its prior written consent including,
423 | without limitation, in connection with any unauthorized modifications
424 | to any of its public licenses or any other arrangements,
425 | understandings, or agreements concerning use of licensed material. For
426 | the avoidance of doubt, this paragraph does not form part of the public
427 | licenses.
428 |
429 | Creative Commons may be contacted at creativecommons.org.
430 |
431 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ##NOTE: You are on the archive version of this RNA-seq analysis tutorial. This version is maintained for consistency with the published materials (Griffith et al. 2015. PLoS Comp Biol.) and for past students wishing to review covered material. However, we strongly suggest that you visit the current version of this tutorial at www.rnaseq.wiki.
2 |
3 | ###Informatics for RNA-seq: A web resource for analysis on the cloud
4 | ===============
5 | An educational tutorial and working demonstration pipeline for RNA-seq analysis including an introduction to: cloud computing, next generation sequence file formats, reference genomes, gene annotation, expression analysis, differential expression analysis, alternative splicing analysis, data visualization, and interpretation.
6 |
7 | This repository is used to store code and certain raw materials for a detailed RNA-seq tutorial. To actually complete this tutorial, go to the RNA-seq tutorial wiki.
8 |
9 | Citation:
10 | Malachi Griffith\*, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith\*. 2015. Informatics for RNA-seq: A web resource for analysis on the cloud. PLoS Comp Biol. 11(8):e1004393.
11 |
12 | \*To whom correspondence should be addressed: E-mail: mgriffit[AT]genome.wustl.edu, ogriffit[AT]genome.wustl.edu
13 |
14 | ===============
15 | ###Tutorial Table of Contents
16 |
17 | - Module 0 - Introduction and Cloud Computing
18 |
19 | - Authors
20 | - Citation and Supplementary Materials
21 | - Syntax
22 | - Intro to AWS Cloud Computing
23 | - Logging into Amazon Cloud
24 | - Unix Bootcamp
25 | - Environment
26 | - Resources
27 |
28 | - Module 1 - Introduction to RNA sequencing
29 |
30 | - Installation
31 | - Reference Genome
32 | - Annotation
33 | - Indexing
34 | - RNA-seq Data
35 | - PreAlignment QC
36 |
37 | - Module 2 - RNA-seq Alignment and Visualization
38 |
39 | - Adapter Trim
40 | - Alignment
41 | - IGV
42 | - PostAlignment Visualization
43 | - PostAlignment QC
44 |
45 | - Module 3 - Expression and Differential Expression
46 |
47 | - Expression
48 | - Differential Expression
49 | - DE Visualization
50 |
51 | - Module 4 - Isoform Discovery and Alternative Expression
52 |
53 | - Reference Guided Transcript Assembly
54 | - de novo Transcript Assembly
55 | - Transcript Assembly Merge
56 | - Differential Splicing
57 | - Transcript Assembly Visualization
58 |
59 | - Module 5 - Reference free analysis
60 |
61 | - Use of Kallisto for Abundance Estimation
62 |
63 | - Appendix
64 |
65 | - Abbreviations
66 | - Lectures
67 | - Practical Exercise Solutions
68 | - Integrated Assignment
69 | - Proposed Improvements
70 | - AWS Setup
71 |
72 |
73 |
74 |
--------------------------------------------------------------------------------
/bin/Tutorial_Module4_ERCC_expression.pl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/perl
2 |
3 | use strict;
4 | use warnings;
5 |
6 | use IO::File;
7 |
8 | my $data_dir = $ENV{RNA_HOME} .'/refs/ERCC';
9 | my $ercc_file = $data_dir .'/ERCC_Controls_Analysis.txt';
10 | my $counts_file = $ENV{RNA_HOME} .'/expression/tophat_counts/gene_read_counts_table_all.tsv';
11 | my $ercc_counts_file = $ENV{RNA_HOME} .'/expression/tophat_counts/ercc_read_counts.tsv';
12 |
13 | my $ercc_fh = IO::File->new($ercc_file,'r');
14 | unless ($ercc_fh) { die('Failed to find file: '. $ercc_file) }
15 |
16 | my %ercc_data;
17 | while (my $ercc_line = $ercc_fh->getline) {
18 | chomp($ercc_line);
19 | if ($ercc_line =~ /^Re/) { next; }
20 | #my ($resort,$id,$subgroup,$mix1,$mix2,$fold_change,$log2)
21 | my @ercc_entry = split("\t",$ercc_line);
22 | $ercc_data{$ercc_entry[1]} = \@ercc_entry;
23 | }
24 |
25 | my @labels = qw/UHR_Rep1 UHR_Rep2 UHR_Rep3 HBR_Rep1 HBR_Rep2 HBR_Rep3/;
26 |
27 | my $counts_fh = IO::File->new($counts_file,'r');
28 | unless ($counts_fh) { die('Failed to find file: '. $counts_file); }
29 |
30 | my $ercc_counts_fh = IO::File->new($ercc_counts_file,'w');
31 | unless ($ercc_counts_fh) { die('Failed to open file: '. $ercc_counts_file); }
32 |
33 | my %count_data;
34 | print $ercc_counts_fh "ID\tSubgroup\tLabel\tMix\tConcentration\tCount\n";
35 | while (my $counts_line = $counts_fh->getline) {
36 | chomp($counts_line);
37 | my @count_entry = split(' ',$counts_line);
38 | if ($ercc_data{$count_entry[0]}) {
39 | my $id = $count_entry[0];
40 | my $subgroup = $ercc_data{$id}->[2];
41 | for (my $i = 0; $i < scalar(@labels); $i++) {
42 | my $count = $count_entry[$i+1];
43 | my $label = $labels[$i];
44 | my $conc;
45 | my $mix;
46 | if ($label =~ /UHR/) {
47 | $mix = 1;
48 | $conc = $ercc_data{$id}->[3];
49 | } else {
50 | $mix = 2;
51 | $conc = $ercc_data{$id}->[4];
52 | }
53 | print $ercc_counts_fh $id ."\t". $subgroup ."\t". $label ."\t". $mix ."\t". $conc ."\t". $count ."\n";
54 | }
55 | }
56 | }
57 |
58 |
59 | exit;
60 |
--------------------------------------------------------------------------------
/manuscript/figures/Fig1.eps:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Fig1.eps
--------------------------------------------------------------------------------
/manuscript/figures/Fig2.eps:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Fig2.eps
--------------------------------------------------------------------------------
/manuscript/figures/Fig3.eps:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Fig3.eps
--------------------------------------------------------------------------------
/manuscript/figures/Fig4.eps:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Fig4.eps
--------------------------------------------------------------------------------
/manuscript/figures/Fig5.eps:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Fig5.eps
--------------------------------------------------------------------------------
/manuscript/figures/Fig6.eps:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Fig6.eps
--------------------------------------------------------------------------------
/manuscript/figures/Figure1.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure1.ai
--------------------------------------------------------------------------------
/manuscript/figures/Figure1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure1.pdf
--------------------------------------------------------------------------------
/manuscript/figures/Figure1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure1.png
--------------------------------------------------------------------------------
/manuscript/figures/Figure2.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure2.ai
--------------------------------------------------------------------------------
/manuscript/figures/Figure2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure2.pdf
--------------------------------------------------------------------------------
/manuscript/figures/Figure2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure2.png
--------------------------------------------------------------------------------
/manuscript/figures/Figure3.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure3.ai
--------------------------------------------------------------------------------
/manuscript/figures/Figure3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure3.pdf
--------------------------------------------------------------------------------
/manuscript/figures/Figure3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure3.png
--------------------------------------------------------------------------------
/manuscript/figures/Figure4.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure4.ai
--------------------------------------------------------------------------------
/manuscript/figures/Figure4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure4.pdf
--------------------------------------------------------------------------------
/manuscript/figures/Figure4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure4.png
--------------------------------------------------------------------------------
/manuscript/figures/Figure5.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure5.ai
--------------------------------------------------------------------------------
/manuscript/figures/Figure5.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure5.pdf
--------------------------------------------------------------------------------
/manuscript/figures/Figure5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure5.png
--------------------------------------------------------------------------------
/manuscript/figures/Figure6.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure6.ai
--------------------------------------------------------------------------------
/manuscript/figures/Figure6.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure6.pdf
--------------------------------------------------------------------------------
/manuscript/figures/Figure6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/Figure6.png
--------------------------------------------------------------------------------
/manuscript/figures/README.md:
--------------------------------------------------------------------------------
1 | Original figure files for all figures in the manuscript: Informatics for RNA-seq: A web resource for analysis on the cloud.
2 |
3 | These files are made available under a creative commons attribution-share alike license (CC BY-SA 4.0):
4 | https://creativecommons.org/licenses/by-sa/4.0/
5 |
6 |
--------------------------------------------------------------------------------
/manuscript/figures/StrikingImage.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/StrikingImage.ai
--------------------------------------------------------------------------------
/manuscript/figures/StrikingImage.eps:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/StrikingImage.eps
--------------------------------------------------------------------------------
/manuscript/figures/StrikingImage.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/StrikingImage.pdf
--------------------------------------------------------------------------------
/manuscript/figures/StrikingImage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/StrikingImage.png
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Fig6-pcb-edits.eps:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Fig6-pcb-edits.eps
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure2/DSC_0267_cropped.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure2/DSC_0267_cropped.jpg
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure2/Figure_2.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure2/Figure_2.ai
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure2/Tumor-normal.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure2/Tumor-normal.ai
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure2/cDNA.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure2/cDNA.ai
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure2/flow-cell.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure2/flow-cell.ai
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure2/flowcell-600dpi.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure2/flowcell-600dpi.jpg
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure2/mapping.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure2/mapping.ai
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure2/rna.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure2/rna.ai
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure2/tissue.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure2/tissue.ai
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure2/total_rna.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure2/total_rna.ai
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure2/unmapped_reads.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure2/unmapped_reads.ai
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure5.ppt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure5.ppt
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure5_new_mockup.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure5_new_mockup.jpg
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure6C_vector.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure6C_vector.ai
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Figure6_vector.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Figure6_vector.ai
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/RNA-Seq-alignment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/RNA-Seq-alignment.png
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/Stranded_vs_Unstranded_SoftwareSettings.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/Stranded_vs_Unstranded_SoftwareSettings.png
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/igv_snapshot_strand.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/igv_snapshot_strand.png
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/igv_snapshot_strand2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/igv_snapshot_strand2.png
--------------------------------------------------------------------------------
/manuscript/figures/raw_materials/igv_snapshot_strand3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/manuscript/figures/raw_materials/igv_snapshot_strand3.png
--------------------------------------------------------------------------------
/manuscript/supplementary_tables/addUrls.pl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env perl
2 |
3 | use warnings;
4 | use strict;
5 |
6 | while(<>){
7 |
8 | my @elements = split(",", $_);
9 | my $c = 0;
10 | foreach my $e (@elements){
11 | $c++;
12 | if ($e =~ /href/i){
13 | print $_;
14 | next;
15 | }
16 | $e =~ s/(\d{7,8})/\$1\<\/a\>/g;
17 | $e .= "," unless ($c == scalar(@elements));
18 | print "$e";
19 | }
20 |
21 | }
22 |
23 | exit;
24 |
--------------------------------------------------------------------------------
/manuscript/supplementary_tables/supplementary_table_1.md:
--------------------------------------------------------------------------------
1 | ###Supplementary Table 1. RNA-seq analysis techniques
2 | There are several downstream analysis goals for which RNA-seq is well suited. Main categories of these are described briefly below with reference to supporting materials. Refer to Supplementary Table 2 for specific tools relevant to many of these areas. For each application, a basic data recommendation is provided. It is important to remember that these are simply examples. In addition to the varying demands of each analysis technique, data requirements will depend heavily on the size and complexity of the genome, the complexity of the transcriptome, the method of RNA isolation and library preparation, the need to robustly detect transcripts with low copy numbers, and many other factors. For the purposes of this table, low RNA-seq depth is 5-25M reads, moderate depth is 25-100M reads, and high depth is 100-500M. Similarly, short reads are 50-200bp and long reads are 200-500bp.
3 |
4 | | RNA-seq analysis technique | Description |
5 | |--------------------------|:------------|
6 | | **Gene annotation and transcript discovery** [19592507, 20436464, 20935650, 21572440, 25608678, 19087247, 21623353] | RNA-seq produces short reads (often paired) from short (~100-500 bp) fragments of cDNA by shotgun sequencing. When performed to high depth for a single sample it is possible to make strong inferences regarding the regions of transcriptional activity within a genome and the exon-intron structure of transcripts expressed from each region. When a reference genome sequence is available, RNA-seq reads can be aligned to this sequence using a splice aware aligner and the intron-exon boundaries and exon-exon connections of expressed transcripts can be determined. More generally, the gene loci where expression occurs can be enumerated. The presence of multiple transcript isoforms expressed from a single locus can also be determined in many cases, though the full length structure may be difficult to completely resolve. In addition to aligning reads to a reference genome sequence, transcripts can also be inferred by de novo transcript assembly. If a reference genome sequence is available, the resulting assembled contigs can be aligned and used to determine exon-intron structure. If no reference genome sequence is available, the resulting contigs can still be used for gene annotation by analysis with ORF finding tools, sequence conservation comparisons to related species, etc. Data recommendation: This application will benefit from longer reads, especially in species with large introns, small exons, and complex splicing patterns. Sequence depth will influence the comprehensiveness of transcripts that can be annotated, with lowly expressed transcripts requiring perhaps 100M reads or more to effectively cover. |
7 | | **Gene and transcript expression estimation** [24109770, 24185837, 24685233, 24885830] | In the ‘gene annotation and transcript discovery’ discussion above we described the use of RNA-seq to determine which gene loci are transcribed by RNA polymerase and how the exon-intron structure of those transcripts is determined by the splicing machinery. Gene and transcript expression estimation by RNA-seq involves the abundance estimation of all transcripts or individual transcript isoforms expressed from each locus. This step often relies on existing transcript annotations for a species (e.g., a transcriptome GTF file from Ensembl) or it requires that you first predict the transcripts present in your data and then derive abundance estimates for those transcripts. Cufflinks and HTSeq are two examples of tools used to estimate transcript and gene abundances. To estimate abundance by RNA-seq, reads are aligned either to a reference genome, reference transcript sequences, assembled transcript contigs derived from the same data, or some combination of these. The read count observed at each locus or for each known transcript sequence is then used to estimate the relative abundance of each transcript. If a spike-in reagent with known concentrations was used during library construction, it may be possible to estimate absolute copy number values in the sample. Abundance estimation tools may attempt to normalize expression values to account for biases related to the different sizes of transcripts, varying GC content, varying library sequence depth, and so on. Further normalization across a series of samples may involve examination of a set of ‘housekeeping genes’ and/or use of other data normalization techniques [22988256]. Data recommendation: This application places one of the lowest demands on library depth compared to other RNA-seq analysis techniques. For gene-level expression estimation, as few as 5-10M reads may be sufficient for mRNA-seq libraries and where possible additional replicates may be preferable to deeper individual libraries. This application will work well with long or short reads that may or may not be paired end. |
8 | | **Differential gene or transcript expression analysis** [22383036, 21903743, 21176179] | Differential gene or transcript expression involves comparison of abundance estimates between two or more conditions. For example, the abundance of a gene observed in different tissues, developmental stages, chemical exposures, disease versus healthy states, etc. There are many biases that influence abundance estimates for each gene or transcript. One advantage of differential expression analysis is that many of these biases will be consist across the samples and ‘cancel out’ leaving potentially biologically relevant differences in gene expression. Unfortunately, many factors may introduce systematic bias that is not equal across the RNA-seq data sets being studied. There are many approaches for identifying batch effects and for performing data normalization that may mitigate the effect of these systematic biases. Data recommendation: This application has close to the same data requirements as expression estimation except that some additional sequence depth may be required to accurately estimate subtle differences in expression between samples. |
9 | | **Alternative expression (alternative transcript initiation, polyadenylation, and splicing) analysis** [20835245] | Alternative expression is closely related to differential expression but attempts to identify differences in the relative ratios of alternative isoforms expressed from a locus. It is possible for the overall expression output from a locus or to remain unchanged between two conditions but have a significant shift in the relative expression levels of alternative isoforms. Alternative expression can be caused by changes in the use of alternative transcript initiation sites, exon-intron splice sites, and polyadenylation sites at a locus. Many human protein-coding loci have extensive potential for alternative expression and the majority of human loci have at least one known alternative isoform. Subtle changes in the structure of transcripts may have pronounced functional consequences but have relatively subtle effects on transcript or gene abundance estimates. Alternative expression analysis by RNA-seq has the potential for a more nuanced representation of the transcriptional state of a sample compared to simple gene expression analysis, though it comes at the cost of more complicated algorithms. Data recommendation: This application will place some of the highest demands on library depth and read length. To robustly assay the alternative expression patterns of human tissues we recommend at least 300-500M reads. This application will also benefit from longer reads. |
10 | | **Allele specific expression (ASE) analysis** [20567245, 21811232] | In diploid (or polyploid) species RNA expression can occur independently from each inherited chromosome. Maternal and paternal derived alleles of each gene locus may contain sequence differences such as common polymorphisms (e.g., SNPs) and mutations. They also differ in their methylation (e.g., imprinting) or other epigenetic states. While many gene loci exhibit balanced expression from each allele, some loci exhibit unbalanced or allele specific expression patterns [20567245]. This allele specific bias could be caused for example by a polymorphism near a promoter that increases transcription factor recruitment and increased polymerase activity for one allele compared to another. The same kinds of allele specific effects can influence choice of alternative transcript initiation sites, alternative splice sites, and alternative polyadenylation sites. Mutations can result in completely novel expressed isoforms being generated from the mutated allele. Allele specific expression analysis uses the presence of known heterozygous polymorphisms within the expressed portion of genes to observe the balance/imbalance of expression from both alleles. In order to perform allele specific expression analysis it is desirable to accurately identify these heterozygous sites in the individual being studied. One therefore typically needs both DNA sequence (DNA-seq) (e.g., WGS or Exome) data as well as RNA-seq data for each sample to be analyzed for allele specific expression. Data recommendation: This application has moderate demands for library depth compared to other RNA-seq analyses. Getting accurate variant allele frequencies (VAFs) will require low library sequence depth for highly expressed genes but high library sequence depth for lowly expressed genes. Measuring allele specific expression for single nucleotide variants will work well with 100 bp reads (or perhaps shorter). Measuring allele specific expression for insertions and deletions (especially >10-20 bp) will benefit from longer read length libraries. |
11 | | **RNA editing analysis** [21960545, 22327324, 22955975] | RNA editing describes nucleotide sequence modifications to RNA molecules that happen after transcription by an RNA polymerase. Such modifications result in apparent changes in the RNA sequence from that which would be predicted from the genome sequence. It is possible to detect such sequence changes in RNA-seq data, but in order to be convinced that the change is due to RNA editing, DNA-seq data is required for the same sample. In simple terms, by comparing the transcribed sequence by RNA-seq to the genome sequence by DNA-seq (WGS or Exome) one can infer that RNA editing has taken place at the RNA level. However, due to sequence errors, mapping artifacts and other sources of systematic biases, care must be taken to distinguish false positives from true RNA editing events and the prevalence of RNA editing as determined by RNA-seq analysis remains a controversial area of research. Data recommendation: This application has moderate demands for library depth, similar to those for allele specific expression. However, since RNA-edits consist primarily of single nucleotide changes, longer read lengths are a lower priority for this application compared to allele specific expression analysis.|
12 | | **Variant detection (variant discovery)** [23555596, 24075185, 22468815] | While variant detection typically involves analysis of DNA-seq data such as WGS or exome data [23341494], it is also possible to perform detection of single nucleotide variants and small insertions or deletions using RNA-seq data [23555596, 24075185, 22468815]. RNA-seq variant detection involves alignment of RNA-seq reads to a reference genome sequence or database of reference transcript sequences followed by scanning the resulting alignments for sites that exhibit sequence base differences relative to the reference sequences. The proportion of reads harboring the variant sequence is used to calculate a variant allele frequency (VAF) from 0 to 100%. The number of variant supporting reads, VAF, base qualities at the variant position, read alignment qualities, overall level of coverage, and other factors collectively influence the confidence of each variant prediction (i.e. the probability that it is a real variant and not a false positive). There are several factors that complicate this variant detection when performed with RNA-seq instead of DNA-seq data. In eukaryotic species, the presence of introns complicates alignment of reads to a reference genome and may lead to alignment errors. In some cases, these errors may result in false positive variant calls where repeated alignment errors result in systematic mismatches. These false positives are enriched near the edges of exons when performing variant discovery with RNA-seq data because correctly resolving exon-intron-exon alignments is difficult, especially with large introns and where only a short portion of a read spans from one exon to the next. Reads that mostly align to one exon but spill over the edge of it can result in misaligned bases. False negatives may also occur in regions of the genome that are difficult to map to, and these alignment holes may be more prevalent where exon-intron structures complicate alignment. Some of these alignment issues may be overcome by aligning reads directly to predicted transcript sequences and performing variant detection by observing sequence differences between the known transcript sequence and aligned reads. Library end bias and corresponding lack of coverage near the 5’ end of transcripts may also result in false negatives due to poor coverage. Furthermore, detection of polymorphisms will be limited to genes that are expressed in the tissue being profiled and the ability to call variants within expressed genes will vary across the range of expression levels. In highly expressed genes, coverage may be extremely high. This can lead to detection of false positives at low variant allele frequency if the variant detector does not use appropriate statistics. In genes that are not expressed, variant detection will not be possible. Some mutations within exons may lead to nonsense mediated decay (NMD) that results in decreased stability of the mutant harboring transcripts. This will reduce the ability to detect such loss of function events when using RNA-seq data alone. Data recommendation: This application has moderate to high demands on library depth and read length depending on the specific type of variant detection as outlined in the following four entries of this table. |
13 | | **Common polymorphism detection** [23555596, 24075185, 22468815] | It is possible by RNA-seq analysis to detect common polymorphisms (e.g., SNPs) that occur within expressed exons [23555596, 24075185]. The sites of many of these are known in many species and this knowledge can be used to guide their detection. Since the expected frequency of heterozygous and homozygous polymorphisms is high (~50% and ~100% respectively) they can be readily detected even in genes with low expression levels and therefore low read coverage. As discussed above, allele specific expression may reduce or increase the expected frequency of heterozygous SNPs. Since the majority of common polymorphisms occur within introns or outside of gene loci, a relatively narrow subset of polymorphisms will be assayed by RNA-seq data alone. Data recommendation: This application has moderate demands on library depth. Since the variants are expected to occur at 50 or 100% VAF, detecting them should be possible with 20-30x coverage at each site. Target library depth will be driven by the amount of data needed to achieve this coverage for lowly expressed genes. As described for allele specific expression above, variants with substantial nucleotide differences from the reference genome sequence (e.g., insertions and deletions) may benefit from longer read lengths to facilitate accurate alignment of reads containing the variant sequence. |
14 | | **Germline mutation detection** [23555596, 24075185, 22468815] | RNA-seq analysis for germline mutation detection is largely equivalent to the detection of polymorphisms as described above except the variants being discovered are very rare in the population (they may even be private to a single individual). Without prior knowledge of the expected site of mutation the analysis must scan the entire transcriptome. Such analysis may be greatly aided by having RNA-seq data from related family members (e.g., a trio of mother, father, child). As with polymorphism detection, mutation detection in RNA-seq data will be complicated by the varying expression levels of each gene and allele specific expression. Data recommendation: This application has essentially the same data needs as common polymorphism detection described above. |
15 | | **Somatic mutation detection** [23555596, 24075185, 22468815] | Somatic mutation detection has many similarities to other types of variant detection described above. It still involves detection of variants but adds an extra consideration to identify the subset of variants that were likely acquired in the DNA of the tumor (i.e. those that are not germline inherited variants). Somatic mutation detection is possible but difficult with RNA-seq data compared to DNA-seq data such as WGS or exome data. Using DNA-seq data it is common to compare tumor sequence data directly to matched normal data to assess the somatic status of variants. The normal DNA sample is usually blood in the case of solid tumors, and usually a skin biopsy in the case of hematologic tumors. Since we expect approximately even coverage across the genome (or exome) for both the tumor and normal sample, we can compare DNA-seq reads at each position harboring a variant in the tumor data and assess its presence in the normal data. Convincing somatic variant sites will have good sequence coverage in both the normal and tumor sample but will only have significant support for the variant base in the tumor data. This kind of sample pairing for tumor/normal comparison is not usually appropriate for RNA-seq data. Using a blood normal RNA sample as a comparator for a solid tumor would not work well because the gene expression pattern for the solid tumor would not be expected to match that of the blood sample. In other words we often may not have coverage of variant sites in both normal and tumor. Furthermore, there may be differences in allele specific expression between the tumor and normal comparator. For some tumor types, it may be possible to obtain a tissue-matched normal sample to use as a comparator for determining somatic status. For example, a breast tumor sample could be compared to adjacent normal breast tissue obtained from the same individual. However, even in such cases, the matched normal may not have the same composition of cell types that the tumor has and there may be significant differences in the transcriptome landscape between tumor and normal samples that confounds somatic variant determination. One strategy that could be used to circumvent this challenge is to compare the tumor RNA-seq data to normal DNA-seq data such as exome data. Given the decreasing cost of WGS and exome data it is probably more appropriate to simply produce RNA-seq data for the tumor and DNA-seq data for both the tumor and a matched normal. Data recommendation: This application is similar to other variant detection types but has substantially increased demands on library sequence depth compared to other categories because it involves the detection of somatic variants in tumor samples that may be contaminated with normal DNA (thereby reducing the observable VAF and number of variant supporting reads) or confounded by tumor heterogeneity (where some mutations exist only in subclonal populations). If a normal RNA sample is used as a comparator to determine the somatic status (often not possible), good coverage of that sample will also be required. As with other variant detection types, characterization of complex variants may benefit from longer reads. |
16 | | **Mutation expression assessment** [24752137] | Perhaps the most common application of RNA-seq data in the sphere of mutation detection is to first detect all mutations (germline or somatic) using DNA-seq data and then only use the RNA-seq data to assess the expression status of each mutation [24752137]. This is equivalent to the allele specific expression analysis described above except that instead of relying on known sites of polymorphism common in the population it relies on a prior mutation detection step using DNA-seq data. The ability to assess the expression status of mutations varies by the complexity of the mutation. Single nucleotide variants (SNVs) will be relatively straightforward but larger insertions and deletions may be more challenging due to challenges in alignment. In general, one should be careful to remember that the failure to confirm expression of a mutation observed at the genome level in the transcriptome could be influenced by differences in alignment between the DNA-seq and RNA-seq at the site of the mutation as well as other RNA-seq specific biases such as reduced RNA-seq coverage at the 5’ ends of transcripts. Data recommendation: This application has perhaps the lowest demands on library depth compared to other variant detection applications since the variants are already detected at the DNA-level and RNA-seq data is only used to assess their expression level. However, these variants may occur anywhere in a transcript and might occur in transcripts will low but functionally significant expression levels. In other words, comprehensive and deep coverage of the transcriptome is still desirable for this application. As with other variant detection types, characterization of complex variants may benefit from longer reads. |
17 | | **Gene fusion detection** [23555082, 23815381, 25266161, 25286921, 25500544] | Gene fusion detection by RNA-seq is mostly performed in the context of tumor sequencing projects [23555082, 22877769, 24320890]. A gene fusion is a chimeric transcript that combines portions of two transcripts normally expressed from two distinct gene loci, ‘gene A’ and ‘gene B’ (e.g., BCR-ABL, EML4-ALK, etc.). Gene fusions may arise as a consequence of structural variations in the genome such as deletions, insertions, inversions, and translocations. Identification of fusion events in RNA-seq data relies on two main forms of alignment information. (1) Paired-end read information where one read of a pair maps to ‘gene A’ and the other read of that pair maps to ‘gene B’. Such reads are sometimes referred to as encompassing reads. (2) Individual reads that align across the junction of ‘gene A’ and ‘gene B’. For example, a read where the first half maps to the edge of an exon in ‘gene A’ and the second half of this read maps to the edge of an exon in ‘gene B’. Such reads are sometimes referred to as spanning reads. Drops or spikes in read coverage levels across the length of either ‘gene A’ or ‘gene B’ that correspond to the apparent breakpoint may also help to support the existence of a gene fusion. In some cases, ‘soft clipped’ reads that align partially and become stretches of mismatches may suggest the presence of a fusion breakpoint. Fusion detection tools currently under development attempt to combine evidence from both the RNA and DNA level to give more accurate predictions. Gene fusion detection tools generally have complex processes for producing alignments suitable for fusion detection, filtering steps to remove false positives that occur widely in genes with paralogs, an assembly step that attempts to determine the fusion sequence, annotation steps that attempt to determine if an in frame fusion product is likely to result from an RNA fusion transcript, and additional annotation steps. Data recommendation: This application has moderate demands on library depth assuming that an oncogenic fusion gene is likely to be expressed at reasonably high levels. Reads should be paired-end as most fusion detection tools assume pairing information will be present. Medium to long reads are desirable to ensure accurate alignment of reads to genes with many paralogs or pseudogenes and also to allow accurate mapping of reads that span across fusion breakpoints that may involve any two points in the genome.|
18 | | **Viral detection** [22647373, 23279287, 23740984] | Expression of some viruses may be detected and their genome characterized by RNA-seq [22647373, 19394993]. In some human tumors, certain viruses may be present either as endogenous elements within the cell or integrated into the genome [23740984, 24085110]. In either case it may be possible to detect expression of viral transcripts in RNA-seq libraries generated from these cells. Detection of viral sequences may involve inclusion of certain viral reference sequences (e.g., HPV, HBV, HCV, EBV, etc.) in the reference genome sequence database to which all RNA-seq reads are aligned. Another strategy is to obtain only those reads that do not align to the reference genome sequence for the species being studies and attempt to align these reads to a database of viral sequences. Some strategies further involve de novo assembly of these reads into contigs prior to alignment to viral sequence databases. Data recommendation: This application requires moderate to high library depth. Detecting sequences that align to viral genomes or distinctive viral k-mers may not require a very deep library if the virus is actively expressing RNAs in the tissue sampled. Identifying fusion sequences involving viruses has many of the same complexities as normal fusion detection and likewise may benefit from longer read lengths. |
19 |
20 |
--------------------------------------------------------------------------------
/manuscript/supplementary_tables/supplementary_table_1_urls.md:
--------------------------------------------------------------------------------
1 | ###Supplementary Table 1. RNA-seq analysis techniques
2 | There are several downstream analysis goals for which RNA-seq is well suited. Main categories of these are described briefly below with reference to supporting materials. Refer to Supplementary Table 2 for specific tools relevant to many of these areas. For each application, a basic data recommendation is provided. It is important to remember that these are simply examples. In addition to the varying demands of each analysis technique, data requirements will depend heavily on the size and complexity of the genome, the complexity of the transcriptome, the method of RNA isolation and library preparation, the need to robustly detect transcripts with low copy numbers, and many other factors. For the purposes of this table, low RNA-seq depth is 5-25M reads, moderate depth is 25-100M reads, and high depth is 100-500M. Similarly, short reads are 50-200bp and long reads are 200-500bp.
3 |
4 | | RNA-seq analysis technique | Description |
5 | |--------------------------|:------------|
6 | | **Gene annotation and transcript discovery** [19592507, 20436464, 20935650, 21572440, 25608678, 19087247, 21623353] | RNA-seq produces short reads (often paired) from short (~100-500 bp) fragments of cDNA by shotgun sequencing. When performed to high depth for a single sample it is possible to make strong inferences regarding the regions of transcriptional activity within a genome and the exon-intron structure of transcripts expressed from each region. When a reference genome sequence is available, RNA-seq reads can be aligned to this sequence using a splice aware aligner and the intron-exon boundaries and exon-exon connections of expressed transcripts can be determined. More generally, the gene loci where expression occurs can be enumerated. The presence of multiple transcript isoforms expressed from a single locus can also be determined in many cases, though the full length structure may be difficult to completely resolve. In addition to aligning reads to a reference genome sequence, transcripts can also be inferred by de novo transcript assembly. If a reference genome sequence is available, the resulting assembled contigs can be aligned and used to determine exon-intron structure. If no reference genome sequence is available, the resulting contigs can still be used for gene annotation by analysis with ORF finding tools, sequence conservation comparisons to related species, etc. Data recommendation: This application will benefit from longer reads, especially in species with large introns, small exons, and complex splicing patterns. Sequence depth will influence the comprehensiveness of transcripts that can be annotated, with lowly expressed transcripts requiring perhaps 100M reads or more to effectively cover. |
7 | | **Gene and transcript expression estimation** [24109770, 24185837, 24685233, 24885830] | In the ‘gene annotation and transcript discovery’ discussion above we described the use of RNA-seq to determine which gene loci are transcribed by RNA polymerase and how the exon-intron structure of those transcripts is determined by the splicing machinery. Gene and transcript expression estimation by RNA-seq involves the abundance estimation of all transcripts or individual transcript isoforms expressed from each locus. This step often relies on existing transcript annotations for a species (e.g., a transcriptome GTF file from Ensembl) or it requires that you first predict the transcripts present in your data and then derive abundance estimates for those transcripts. Cufflinks and HTSeq are two examples of tools used to estimate transcript and gene abundances. To estimate abundance by RNA-seq, reads are aligned either to a reference genome, reference transcript sequences, assembled transcript contigs derived from the same data, or some combination of these. The read count observed at each locus or for each known transcript sequence is then used to estimate the relative abundance of each transcript. If a spike-in reagent with known concentrations was used during library construction, it may be possible to estimate absolute copy number values in the sample. Abundance estimation tools may attempt to normalize expression values to account for biases related to the different sizes of transcripts, varying GC content, varying library sequence depth, and so on. Further normalization across a series of samples may involve examination of a set of ‘housekeeping genes’ and/or use of other data normalization techniques [22988256]. Data recommendation: This application places one of the lowest demands on library depth compared to other RNA-seq analysis techniques. For gene-level expression estimation, as few as 5-10M reads may be sufficient for mRNA-seq libraries and where possible additional replicates may be preferable to deeper individual libraries. This application will work well with long or short reads that may or may not be paired end. |
8 | | **Differential gene or transcript expression analysis** [22383036, 21903743, 21176179] | Differential gene or transcript expression involves comparison of abundance estimates between two or more conditions. For example, the abundance of a gene observed in different tissues, developmental stages, chemical exposures, disease versus healthy states, etc. There are many biases that influence abundance estimates for each gene or transcript. One advantage of differential expression analysis is that many of these biases will be consist across the samples and ‘cancel out’ leaving potentially biologically relevant differences in gene expression. Unfortunately, many factors may introduce systematic bias that is not equal across the RNA-seq data sets being studied. There are many approaches for identifying batch effects and for performing data normalization that may mitigate the effect of these systematic biases. Data recommendation: This application has close to the same data requirements as expression estimation except that some additional sequence depth may be required to accurately estimate subtle differences in expression between samples. |
9 | | **Alternative expression (alternative transcript initiation, polyadenylation, and splicing) analysis** [20835245] | Alternative expression is closely related to differential expression but attempts to identify differences in the relative ratios of alternative isoforms expressed from a locus. It is possible for the overall expression output from a locus or to remain unchanged between two conditions but have a significant shift in the relative expression levels of alternative isoforms. Alternative expression can be caused by changes in the use of alternative transcript initiation sites, exon-intron splice sites, and polyadenylation sites at a locus. Many human protein-coding loci have extensive potential for alternative expression and the majority of human loci have at least one known alternative isoform. Subtle changes in the structure of transcripts may have pronounced functional consequences but have relatively subtle effects on transcript or gene abundance estimates. Alternative expression analysis by RNA-seq has the potential for a more nuanced representation of the transcriptional state of a sample compared to simple gene expression analysis, though it comes at the cost of more complicated algorithms. Data recommendation: This application will place some of the highest demands on library depth and read length. To robustly assay the alternative expression patterns of human tissues we recommend at least 300-500M reads. This application will also benefit from longer reads. |
10 | | **Allele specific expression (ASE) analysis** [20567245, 21811232] | In diploid (or polyploid) species RNA expression can occur independently from each inherited chromosome. Maternal and paternal derived alleles of each gene locus may contain sequence differences such as common polymorphisms (e.g., SNPs) and mutations. They also differ in their methylation (e.g., imprinting) or other epigenetic states. While many gene loci exhibit balanced expression from each allele, some loci exhibit unbalanced or allele specific expression patterns [20567245]. This allele specific bias could be caused for example by a polymorphism near a promoter that increases transcription factor recruitment and increased polymerase activity for one allele compared to another. The same kinds of allele specific effects can influence choice of alternative transcript initiation sites, alternative splice sites, and alternative polyadenylation sites. Mutations can result in completely novel expressed isoforms being generated from the mutated allele. Allele specific expression analysis uses the presence of known heterozygous polymorphisms within the expressed portion of genes to observe the balance/imbalance of expression from both alleles. In order to perform allele specific expression analysis it is desirable to accurately identify these heterozygous sites in the individual being studied. One therefore typically needs both DNA sequence (DNA-seq) (e.g., WGS or Exome) data as well as RNA-seq data for each sample to be analyzed for allele specific expression. Data recommendation: This application has moderate demands for library depth compared to other RNA-seq analyses. Getting accurate variant allele frequencies (VAFs) will require low library sequence depth for highly expressed genes but high library sequence depth for lowly expressed genes. Measuring allele specific expression for single nucleotide variants will work well with 100 bp reads (or perhaps shorter). Measuring allele specific expression for insertions and deletions (especially >10-20 bp) will benefit from longer read length libraries. |
11 | | **RNA editing analysis** [21960545, 22327324, 22955975] | RNA editing describes nucleotide sequence modifications to RNA molecules that happen after transcription by an RNA polymerase. Such modifications result in apparent changes in the RNA sequence from that which would be predicted from the genome sequence. It is possible to detect such sequence changes in RNA-seq data, but in order to be convinced that the change is due to RNA editing, DNA-seq data is required for the same sample. In simple terms, by comparing the transcribed sequence by RNA-seq to the genome sequence by DNA-seq (WGS or Exome) one can infer that RNA editing has taken place at the RNA level. However, due to sequence errors, mapping artifacts and other sources of systematic biases, care must be taken to distinguish false positives from true RNA editing events and the prevalence of RNA editing as determined by RNA-seq analysis remains a controversial area of research. Data recommendation: This application has moderate demands for library depth, similar to those for allele specific expression. However, since RNA-edits consist primarily of single nucleotide changes, longer read lengths are a lower priority for this application compared to allele specific expression analysis.|
12 | | **Variant detection (variant discovery)** [23555596, 24075185, 22468815] | While variant detection typically involves analysis of DNA-seq data such as WGS or exome data [23341494], it is also possible to perform detection of single nucleotide variants and small insertions or deletions using RNA-seq data [23555596, 24075185, 22468815]. RNA-seq variant detection involves alignment of RNA-seq reads to a reference genome sequence or database of reference transcript sequences followed by scanning the resulting alignments for sites that exhibit sequence base differences relative to the reference sequences. The proportion of reads harboring the variant sequence is used to calculate a variant allele frequency (VAF) from 0 to 100%. The number of variant supporting reads, VAF, base qualities at the variant position, read alignment qualities, overall level of coverage, and other factors collectively influence the confidence of each variant prediction (i.e. the probability that it is a real variant and not a false positive). There are several factors that complicate this variant detection when performed with RNA-seq instead of DNA-seq data. In eukaryotic species, the presence of introns complicates alignment of reads to a reference genome and may lead to alignment errors. In some cases, these errors may result in false positive variant calls where repeated alignment errors result in systematic mismatches. These false positives are enriched near the edges of exons when performing variant discovery with RNA-seq data because correctly resolving exon-intron-exon alignments is difficult, especially with large introns and where only a short portion of a read spans from one exon to the next. Reads that mostly align to one exon but spill over the edge of it can result in misaligned bases. False negatives may also occur in regions of the genome that are difficult to map to, and these alignment holes may be more prevalent where exon-intron structures complicate alignment. Some of these alignment issues may be overcome by aligning reads directly to predicted transcript sequences and performing variant detection by observing sequence differences between the known transcript sequence and aligned reads. Library end bias and corresponding lack of coverage near the 5’ end of transcripts may also result in false negatives due to poor coverage. Furthermore, detection of polymorphisms will be limited to genes that are expressed in the tissue being profiled and the ability to call variants within expressed genes will vary across the range of expression levels. In highly expressed genes, coverage may be extremely high. This can lead to detection of false positives at low variant allele frequency if the variant detector does not use appropriate statistics. In genes that are not expressed, variant detection will not be possible. Some mutations within exons may lead to nonsense mediated decay (NMD) that results in decreased stability of the mutant harboring transcripts. This will reduce the ability to detect such loss of function events when using RNA-seq data alone. Data recommendation: This application has moderate to high demands on library depth and read length depending on the specific type of variant detection as outlined in the following four entries of this table. |
13 | | **Common polymorphism detection** [23555596, 24075185, 22468815] | It is possible by RNA-seq analysis to detect common polymorphisms (e.g., SNPs) that occur within expressed exons [23555596, 24075185]. The sites of many of these are known in many species and this knowledge can be used to guide their detection. Since the expected frequency of heterozygous and homozygous polymorphisms is high (~50% and ~100% respectively) they can be readily detected even in genes with low expression levels and therefore low read coverage. As discussed above, allele specific expression may reduce or increase the expected frequency of heterozygous SNPs. Since the majority of common polymorphisms occur within introns or outside of gene loci, a relatively narrow subset of polymorphisms will be assayed by RNA-seq data alone. Data recommendation: This application has moderate demands on library depth. Since the variants are expected to occur at 50 or 100% VAF, detecting them should be possible with 20-30x coverage at each site. Target library depth will be driven by the amount of data needed to achieve this coverage for lowly expressed genes. As described for allele specific expression above, variants with substantial nucleotide differences from the reference genome sequence (e.g., insertions and deletions) may benefit from longer read lengths to facilitate accurate alignment of reads containing the variant sequence. |
14 | | **Germline mutation detection** [23555596, 24075185, 22468815] | RNA-seq analysis for germline mutation detection is largely equivalent to the detection of polymorphisms as described above except the variants being discovered are very rare in the population (they may even be private to a single individual). Without prior knowledge of the expected site of mutation the analysis must scan the entire transcriptome. Such analysis may be greatly aided by having RNA-seq data from related family members (e.g., a trio of mother, father, child). As with polymorphism detection, mutation detection in RNA-seq data will be complicated by the varying expression levels of each gene and allele specific expression. Data recommendation: This application has essentially the same data needs as common polymorphism detection described above. |
15 | | **Somatic mutation detection** [23555596, 24075185, 22468815] | Somatic mutation detection has many similarities to other types of variant detection described above. It still involves detection of variants but adds an extra consideration to identify the subset of variants that were likely acquired in the DNA of the tumor (i.e. those that are not germline inherited variants). Somatic mutation detection is possible but difficult with RNA-seq data compared to DNA-seq data such as WGS or exome data. Using DNA-seq data it is common to compare tumor sequence data directly to matched normal data to assess the somatic status of variants. The normal DNA sample is usually blood in the case of solid tumors, and usually a skin biopsy in the case of hematologic tumors. Since we expect approximately even coverage across the genome (or exome) for both the tumor and normal sample, we can compare DNA-seq reads at each position harboring a variant in the tumor data and assess its presence in the normal data. Convincing somatic variant sites will have good sequence coverage in both the normal and tumor sample but will only have significant support for the variant base in the tumor data. This kind of sample pairing for tumor/normal comparison is not usually appropriate for RNA-seq data. Using a blood normal RNA sample as a comparator for a solid tumor would not work well because the gene expression pattern for the solid tumor would not be expected to match that of the blood sample. In other words we often may not have coverage of variant sites in both normal and tumor. Furthermore, there may be differences in allele specific expression between the tumor and normal comparator. For some tumor types, it may be possible to obtain a tissue-matched normal sample to use as a comparator for determining somatic status. For example, a breast tumor sample could be compared to adjacent normal breast tissue obtained from the same individual. However, even in such cases, the matched normal may not have the same composition of cell types that the tumor has and there may be significant differences in the transcriptome landscape between tumor and normal samples that confounds somatic variant determination. One strategy that could be used to circumvent this challenge is to compare the tumor RNA-seq data to normal DNA-seq data such as exome data. Given the decreasing cost of WGS and exome data it is probably more appropriate to simply produce RNA-seq data for the tumor and DNA-seq data for both the tumor and a matched normal. Data recommendation: This application is similar to other variant detection types but has substantially increased demands on library sequence depth compared to other categories because it involves the detection of somatic variants in tumor samples that may be contaminated with normal DNA (thereby reducing the observable VAF and number of variant supporting reads) or confounded by tumor heterogeneity (where some mutations exist only in subclonal populations). If a normal RNA sample is used as a comparator to determine the somatic status (often not possible), good coverage of that sample will also be required. As with other variant detection types, characterization of complex variants may benefit from longer reads. |
16 | | **Mutation expression assessment** [24752137] | Perhaps the most common application of RNA-seq data in the sphere of mutation detection is to first detect all mutations (germline or somatic) using DNA-seq data and then only use the RNA-seq data to assess the expression status of each mutation [24752137]. This is equivalent to the allele specific expression analysis described above except that instead of relying on known sites of polymorphism common in the population it relies on a prior mutation detection step using DNA-seq data. The ability to assess the expression status of mutations varies by the complexity of the mutation. Single nucleotide variants (SNVs) will be relatively straightforward but larger insertions and deletions may be more challenging due to challenges in alignment. In general, one should be careful to remember that the failure to confirm expression of a mutation observed at the genome level in the transcriptome could be influenced by differences in alignment between the DNA-seq and RNA-seq at the site of the mutation as well as other RNA-seq specific biases such as reduced RNA-seq coverage at the 5’ ends of transcripts. Data recommendation: This application has perhaps the lowest demands on library depth compared to other variant detection applications since the variants are already detected at the DNA-level and RNA-seq data is only used to assess their expression level. However, these variants may occur anywhere in a transcript and might occur in transcripts will low but functionally significant expression levels. In other words, comprehensive and deep coverage of the transcriptome is still desirable for this application. As with other variant detection types, characterization of complex variants may benefit from longer reads. |
17 | | **Gene fusion detection** [23555082, 23815381, 25266161, 25286921, 25500544] | Gene fusion detection by RNA-seq is mostly performed in the context of tumor sequencing projects [23555082, 22877769, 24320890]. A gene fusion is a chimeric transcript that combines portions of two transcripts normally expressed from two distinct gene loci, ‘gene A’ and ‘gene B’ (e.g., BCR-ABL, EML4-ALK, etc.). Gene fusions may arise as a consequence of structural variations in the genome such as deletions, insertions, inversions, and translocations. Identification of fusion events in RNA-seq data relies on two main forms of alignment information. (1) Paired-end read information where one read of a pair maps to ‘gene A’ and the other read of that pair maps to ‘gene B’. Such reads are sometimes referred to as encompassing reads. (2) Individual reads that align across the junction of ‘gene A’ and ‘gene B’. For example, a read where the first half maps to the edge of an exon in ‘gene A’ and the second half of this read maps to the edge of an exon in ‘gene B’. Such reads are sometimes referred to as spanning reads. Drops or spikes in read coverage levels across the length of either ‘gene A’ or ‘gene B’ that correspond to the apparent breakpoint may also help to support the existence of a gene fusion. In some cases, ‘soft clipped’ reads that align partially and become stretches of mismatches may suggest the presence of a fusion breakpoint. Fusion detection tools currently under development attempt to combine evidence from both the RNA and DNA level to give more accurate predictions. Gene fusion detection tools generally have complex processes for producing alignments suitable for fusion detection, filtering steps to remove false positives that occur widely in genes with paralogs, an assembly step that attempts to determine the fusion sequence, annotation steps that attempt to determine if an in frame fusion product is likely to result from an RNA fusion transcript, and additional annotation steps. Data recommendation: This application has moderate demands on library depth assuming that an oncogenic fusion gene is likely to be expressed at reasonably high levels. Reads should be paired-end as most fusion detection tools assume pairing information will be present. Medium to long reads are desirable to ensure accurate alignment of reads to genes with many paralogs or pseudogenes and also to allow accurate mapping of reads that span across fusion breakpoints that may involve any two points in the genome.|
18 | | **Viral detection** [22647373, 23279287, 23740984] | Expression of some viruses may be detected and their genome characterized by RNA-seq [22647373, 19394993]. In some human tumors, certain viruses may be present either as endogenous elements within the cell or integrated into the genome [23740984, 24085110]. In either case it may be possible to detect expression of viral transcripts in RNA-seq libraries generated from these cells. Detection of viral sequences may involve inclusion of certain viral reference sequences (e.g., HPV, HBV, HCV, EBV, etc.) in the reference genome sequence database to which all RNA-seq reads are aligned. Another strategy is to obtain only those reads that do not align to the reference genome sequence for the species being studies and attempt to align these reads to a database of viral sequences. Some strategies further involve de novo assembly of these reads into contigs prior to alignment to viral sequence databases. Data recommendation: This application requires moderate to high library depth. Detecting sequences that align to viral genomes or distinctive viral k-mers may not require a very deep library if the virus is actively expressing RNAs in the tissue sampled. Identifying fusion sequences involving viruses has many of the same complexities as normal fusion detection and likewise may benefit from longer read lengths. |
19 |
20 |
--------------------------------------------------------------------------------
/manuscript/supplementary_tables/supplementary_table_2.md:
--------------------------------------------------------------------------------
1 | ###Supplementary Table 2. Tools for RNA-seq analysis
2 | All tools used in the online tutorial (www.rnaseq.wiki) are referenced below (in bold) along with alternative tools in each category. Where possible a citation is provided. Links are also provided to help the user evaluate the code and the level of maintenance. Where possible the link goes directly to a source controlled repository such as a git repo. Additional lists of tools can be found here: Alamancos et al. (arXiv), Hooper et al. [24447644], the rna-seqblog, and RNA-seq - Protocols and Algorithms. This table is meant to be comprehensive but not exhaustive. Some RNA-seq analysis applications that are not explicitly covered here include co-regulation (co-expression), disease classification, time series, expression compendium databases, outlier expression, data normalization, and miRNA analysis.
3 |
4 | | Category | Representative tools |
5 | |----------|:---------------------|
6 | | **Raw data QC** [25577376, 25150838] | FastQC, HTQC [23363224], QC3 [24703969], kPAL [25514851]. |
7 | | **Read trimming** [24376861] | Trimmomatic [24695404], Skewer [24925680], Flexbar [24832523], FASTX. |
8 | | **Alignment (splice aware, for alignment to a reference genome)** [24185836] | TopHat [19289445, 23618408], STAR [23104886], HISAT [25751142], segemehl [24512684], GSNAP, MapSplice [20802226], JAGuaR [25062255], SpliceMap [20371516], HMMSplicer [21079731], TrueSight/UnSplicer [24259430]. |
9 | | **Alignment (non splice aware for alignment to a reference transcriptome)** [23060614, 23758764] | BowTie [19261174], Bwa [19451168]. |
10 | | **Post-alignment QC** [24185836] | FastQC, samtools [19505943], QuaCRS [25368506], RSeQC [22743226], RNA-SeQC [22539670], Picard CollectRnaSeqMetrics, BAMstats, SAMstat [21088025], BlackOPs [23935067], seqbias [22285831]. |
11 | | **Gene/transcriptome annotation** [24722185, 25319663] | Annocript [25701574], XSAnno [24884593], GeneMark-ET [24990371], WImpiBLAST [24979410], RNASEG [24780064], TSSAR [24674136], Vicinal [24623808], OMIGA [24609470], CoRAL [24145223], AfterParty [24093729], ShortStack [23610128], CIRI [25583365]. |
12 | | **Small RNA identification and characterization (e.g., miRNAs)** [25319663, 23720668] | ShortStack [23610128], CoRAL [24145223], MTide [25256573], FlaiMapper [25338717], miRPlant [25117656], PROmiRNA [23958307], omiRas [23946503], DREAM [25840043]. |
13 | | **Transcript assembly (reference genome guided)** [24185837, 21897427, 23393030] | Cufflinks [20436464], Scripture [20436462], StringTie [25690850], bayesembler [25367074], IsoLasso [21951053]. |
14 | | **Transcript assembly (de novo, reference genome free)** [21897427, 23393030, 23056003, 23666209, 25084827, 25279728, 25788326] | Trinity [23845962], Trans-ABySS [20935650], Oases [22368243], RSEM [21816040], DETONATE [25608678], SEECER (sequencing error correction for assembly) [23558750], BRANCH [23493323] uses partial or related genomics sequences as a guide, EBARDenovo [23457040], Bridger [25723335]. |
15 | | **Transcript abundance or expression estimation (FPKM/RPKM)** [24185837, 24885830, 24109770, 24685233] | Cufflinks [20436464], eXpress [23160280], RSEM [21816040], Sailfish (alignment free) [24752080], RNA-Skim (alignment free) [24931995], MITIE [23980025], ireckon [23204306], DRUT [23202426]. |
16 | | **Obtaining raw transcript/gene read counts (FPM/RPM)** [21176179] | HTSeq [25260700], FeatureCounts [24227677], Rcount [25322836], maxcounts [24564404], FIXSEQ (adjusts counts to compensate for overdispersion) [24603409], Cuffquant. |
17 | | **Differential expression** [25119138, 24300110, 24020486, 25024085] | Cuffdiff [23222703], limma [25605792], DESeq2 [25516281], EdgeR [19910308], Corset (for de novo assembled transcriptomes) [25063469], sSeq [23589650], BADGE [25252852], compcodeR [24813215], metaRNASeq [24678608], Characteristic Direction [24650281], NPEBseq [23981227]. |
18 | | **Alternative splicing, alternative expression** [24447644, 24885830, 24058384, 24549677, 24951248, 25511303] | Cuffdiff [23222703], DEXSeq [22722343], ALEXA-seq [20835245], IUTA [25283306], FineSplice [24574529], PennSeq [24362841], FlipFlop [24813214], SNPlice [25481010], spliceR [24655717], GESS [24447644], RNASeq-MATS [23872975], SplicingCompass [23449093], DiffSplice [23155066], SigFuge [25030904], SUPPA [bioRXiv], CLASS [bioRXiv], SplAdder [bioRXiv], SplicePie [25800735]. |
19 | | **Variant (e.g., SNP) and mutation detection** [23555596, 24075185, 22468815], germline or somatic, and eQTL/sQTL characterization [25733796] | GATK (Best Practices Guide) [20644199], samtools [19505943], SNVMix [20130035], SNPlice [25481010], eSNV-detect [25352556], RVboost [25170027], sQTLseekeR [25140736], eQTL/ASE, [BioRXiv], SNiPloid [24163691], SNPiR [24075185], QualitySNPng [23632165], RNAmapper [23299976], CRAC [23537109], RADIA [25405470]. |
20 | | **RNA editing** [22327324, 23291724, 23598527, 25859542] | REDItools [23742983], GIREMI [25730491], ICEBreaker [25855956]. |
21 | | **Allele specific expression** [23919664, 25183311, 25339465] | AlleleSeq [21811232], Allim [23615333], mamba [25819081], EMASE, MBASED [25315065], limma [25605792].
22 | | **Viral detection** [23740984, 23279287, 22647373] | VirusSeq [23162058], VirusFinder [23717618], RNA CoMPASS [24586784]. |
23 | | **Fusion detection** [25500544, 25266161, 23815381, 23555082, 25286921] | FusionQ [23815381], TRUP [25650807], Dissect [22689759], Trans-ABySS [20935650], PRADA (RNA-seq pipeline with a fusion module) [24695405], Pegasus (used for fusion annotation) [25183062], FusionCatcher, ChimeraScan [21840877], TopHat-fusion [21835007], BreakFusion [22563071], deFuse [21625565], FusionHunter [21546395], EricScript [23093608], Barnacle [23941359], bellerophontes [22711792], Chimera (merge results from multiple fusion algorithms) [25286921], GFML (format for representing fusion data) [23072312]. |
24 | | **Visualization** [24792048, 25757788] | SplicingViewer [22226708], IGV [22517427], Sashimi plots [25617416], IGB (splicing visualization protocol) [24792048], PrimerSeq (Visualize RNA-seq data for primer design) [24747190], ASTALAVISTA [25577392], Circos [19541911], Epiviz [25086505], RNAbrowse [24823498], ZENBU [24727769], RNAseqViewer [24215023], viRome [23709497], miRseqViewer [25322835], Circleator [25075113], RNASeqBrowser [25766521]. |
25 | | **Integration of DNA-seq and RNA-seq data** [23499923] | Veridical [24741438], SpliceFinder [24498620], nFuse [22745232], RADIA [25405470]. |
26 |
27 |
28 |
29 |
30 |
31 |
--------------------------------------------------------------------------------
/manuscript/supplementary_tables/supplementary_table_2_urls.md:
--------------------------------------------------------------------------------
1 | ###Supplementary Table 2. Tools for RNA-seq analysis
2 | All tools used in the online tutorial (www.rnaseq.wiki) are referenced below (in bold) along with alternative tools in each category. Where possible a citation is provided. Links are also provided to help the user evaluate the code and the level of maintenance. Where possible the link goes directly to a source controlled repository such as a git repo. Additional lists of tools can be found here: Alamancos et al. (arXiv), Hooper et al. [24447644], the rna-seqblog, and RNA-seq - Protocols and Algorithms. This table is meant to be comprehensive but not exhaustive. Some RNA-seq analysis applications that are not explicitly covered here include co-regulation (co-expression), disease classification, time series, expression compendium databases, outlier expression, data normalization, and miRNA analysis.
3 |
4 | | Category | Representative tools |
5 | |----------|:---------------------|
6 | | **Raw data QC** [25577376, 25150838] | FastQC, HTQC [23363224], QC3 [24703969], kPAL [25514851]. |
7 | | **Read trimming** [24376861] | Trimmomatic [24695404], Skewer [24925680], Flexbar [24832523], FASTX. |
8 | | **Alignment (splice aware, for alignment to a reference genome)** [24185836] | TopHat [19289445, 23618408], STAR [23104886], HISAT [25751142], HISAT2, segemehl [24512684], SubRead, GSNAP, MapSplice [20802226], JAGuaR [25062255], SpliceMap [20371516], HMMSplicer [21079731], TrueSight/UnSplicer [24259430]. |
9 | | **Alignment (non splice aware for alignment to a reference transcriptome)** [23060614, 23758764] | BowTie [19261174], Bwa [19451168]. |
10 | | **Post-alignment QC** [24185836] | FastQC, samtools [19505943], QuaCRS [25368506], RSeQC [22743226], RNA-SeQC [22539670], Picard CollectRnaSeqMetrics, BAMstats, SAMstat [21088025], BlackOPs [23935067], seqbias [22285831]. |
11 | | **Gene/transcriptome annotation** [24722185, 25319663] | Annocript [25701574], XSAnno [24884593], GeneMark-ET [24990371], WImpiBLAST [24979410], RNASEG [24780064], TSSAR [24674136], Vicinal [24623808], OMIGA [24609470], CoRAL [24145223], AfterParty [24093729], ShortStack [23610128], CIRI [25583365]. |
12 | | **Small RNA identification and characterization (e.g., miRNAs)** [25319663, 23720668] | ShortStack [23610128], CoRAL [24145223], MTide [25256573], FlaiMapper [25338717], miRPlant [25117656], PROmiRNA [23958307], omiRas [23946503], DREAM [25840043]. |
13 | | **Transcript assembly (reference genome guided)** [24185837, 21897427, 23393030] | Cufflinks [20436464], Scripture [20436462], StringTie [25690850], bayesembler [25367074], IsoLasso [21951053]. |
14 | | **Transcript assembly (de novo, reference genome free)** [21897427, 23393030, 23056003, 23666209, 25084827, 25279728, 25788326] | Trinity [23845962], Trans-ABySS [20935650], Oases [22368243], RSEM [21816040], DETONATE [25608678], SEECER (sequencing error correction for assembly) [23558750], BRANCH [23493323] uses partial or related genomics sequences as a guide, EBARDenovo [23457040], Bridger [25723335]. |
15 | | **Transcript abundance or expression estimation (FPKM/RPKM)** [24185837, 24885830, 24109770, 24685233] | Cufflinks [20436464], eXpress [23160280], RSEM [21816040], Sailfish (alignment free) [24752080], RNA-Skim (alignment free) [24931995], MITIE [23980025], ireckon [23204306], DRUT [23202426], Kallisto (alignment free) [arXiv]. |
16 | | **Obtaining raw transcript/gene read counts (FPM/RPM)** [21176179] | HTSeq [25260700], FeatureCounts [24227677], Rcount [25322836], maxcounts [24564404], FIXSEQ (adjusts counts to compensate for overdispersion) [24603409], Cuffquant. |
17 | | **Differential expression** [25119138, 24300110, 24020486, 25024085] | Cuffdiff [23222703], limma [25605792], DESeq2 [25516281], EdgeR [19910308], Corset (for de novo assembled transcriptomes) [25063469], sSeq [23589650], BADGE [25252852], compcodeR [24813215], metaRNASeq [24678608], Characteristic Direction [24650281], NPEBseq [23981227]. |
18 | | **Alternative splicing, alternative expression** [24447644, 24885830, 24058384, 24549677, 24951248, 25511303] | Cuffdiff [23222703], DEXSeq [22722343], ALEXA-seq [20835245], IUTA [25283306], FineSplice [24574529], PennSeq [24362841], FlipFlop [24813214], SNPlice [25481010], spliceR [24655717], GESS [24447644], RNASeq-MATS [23872975], SplicingCompass [23449093], DiffSplice [23155066], SigFuge [25030904], SUPPA [bioRXiv], CLASS [bioRXiv], SplAdder [bioRXiv], SplicePie [25800735]. |
19 | | **Variant (e.g., SNP) and mutation detection** [23555596, 24075185, 22468815], germline or somatic, and eQTL/sQTL characterization [25733796] | GATK (Best Practices Guide) [20644199], samtools [19505943], SNVMix [20130035], SNPlice [25481010], eSNV-detect [25352556], RVboost [25170027], sQTLseekeR [25140736], eQTL/ASE, [BioRXiv], SNiPloid [24163691], SNPiR [24075185], QualitySNPng [23632165], RNAmapper [23299976], CRAC [23537109], RADIA [25405470]. |
20 | | **RNA editing** [22327324, 23291724, 23598527, 25859542] | REDItools [23742983], GIREMI [25730491], ICEBreaker [25855956]. |
21 | | **Allele specific expression** [23919664, 25183311, 25339465] | AlleleSeq [21811232], Allim [23615333], mamba [25819081], EMASE, MBASED [25315065], limma [25605792].
22 | | **Viral detection** [23740984, 23279287, 22647373] | VirusSeq [23162058], VirusFinder [23717618], RNA CoMPASS [24586784]. |
23 | | **Fusion detection** [25500544, 25266161, 23815381, 23555082, 25286921] | FusionQ [23815381], TRUP [25650807], Dissect [22689759], Trans-ABySS [20935650], PRADA (RNA-seq pipeline with a fusion module) [24695405], Pegasus (used for fusion annotation) [25183062], FusionCatcher, ChimeraScan [21840877], TopHat-fusion [21835007], BreakFusion [22563071], deFuse [21625565], FusionHunter [21546395], EricScript [23093608], Barnacle [23941359], bellerophontes [22711792], Chimera (merge results from multiple fusion algorithms) [25286921], GFML (format for representing fusion data) [23072312]. |
24 | | **Visualization** [24792048, 25757788] | SplicingViewer [22226708], IGV [22517427], Sashimi plots [25617416], IGB (splicing visualization protocol) [24792048], PrimerSeq (Visualize RNA-seq data for primer design) [24747190], ASTALAVISTA [25577392], Circos [19541911], Epiviz [25086505], RNAbrowse [24823498], ZENBU [24727769], RNAseqViewer [24215023], viRome [23709497], miRseqViewer [25322835], Circleator [25075113], RNASeqBrowser [25766521]. |
25 | | **Integration of DNA-seq and RNA-seq data** [23499923] | Veridical [24741438], SpliceFinder [24498620], nFuse [22745232], RADIA [25405470]. |
26 |
27 |
28 |
29 |
30 |
31 |
--------------------------------------------------------------------------------
/manuscript/supplementary_tables/supplementary_table_3.md:
--------------------------------------------------------------------------------
1 | ###Supplementary Table 3. Concepts in sample preparation and library construction that can influence study design, analysis and interpretation
2 | The following table summarizes several key concepts relating to sample preparation and library construction that may influence analysis and interpretation of RNA-seq data. Several initiatives are underway to develop standards and best practices that cover many of these concepts. These include: the Sequencing Quality Control (SEQC) consortium (PMID: 25150838), the Encyclopedia of DNA Elements (ENCODE) consortium, the Roadmap Epigenomics Mapping Consortium (REMC), and the Beta Cell Biology Consortium (BCBC).
3 |
4 | | Strategy/concept | Relevance to RNA-seq analysis and data interpretation |
5 | |--------------------------|-------------------------------------------------------|
6 | | RNA integrity/degradation | RNA is susceptible to degradation, much more so than DNA. RNA degradation can significantly impact library complexity, alignment (PMID: 25339126), transcript quantification (PMID: 24885439) and other RNA-seq applications. Degradation happens by various mechanisms. Various sample handling procedures and best practices are commonly employed to maintain intact RNA molecules in solution. For example, RNA isolation is recommended immediately upon obtaining cells. In cases where immediate RNA isolation is not possible, tissues may be stored in preserving agents meant to protect RNA until isolation is possible (e.g. ‘RNAlater’). To prevent the degrading activity of endogenous RNAses, RNA isolation involves use of buffers that inhibit the activity of these enzymes. In addition to buffer conditions (e.g. pH), RNA isolation may involve use of RNAse inhibitors. Similarly, immediate precipitation and removal of protein (especially RNAses) from the sample reduces the risk of RNA degradation. Use of buffers containing chelating agents inhibits hydrolysis of RNA that can lead to strand cleavage. Best practices such as performing isolation at low temperatures (e.g. on ice) as well as maintaining clean conditions to prevent introduction of exogenous RNAses are common. At the completion of RNA isolation and prior to sequencing library construction, RNA quality is routinely assayed by gel electrophoresis or capillary electrophoresis, such as the Agilent 2100 bioanalyzer that provides a qualitatively interpretable ‘trace’ and single RNA integrity score (also known as a RIN number). RIN scores vary from 0 to 10. A max score of 10 indicates intact RNA. The lower the score reported, the greater the level of degradation. Many researchers require that RNA isolated from fresh frozen material have a RIN of 6-8 or greater. If RNA is isolated from FFPE archival samples, the RIN will usually be much lower than this. For FFPE materials, alternate strategies may be used to evaluate RNA quality. For example, some researchers choose a ‘DV200’ cutoff. The DV200 metric describes the percentage of RNA fragments greater than 200 bp in length (refer to this TechNote on TruSeq RNA Access for a more detailed discussion on assessing FFPE RNA quality). Once isolated, RNA is typically stored at -80℃ to inhibit degradation over time. If despite all of these efforts, an RNA sample is degraded, this may result in small fragmented RNA and an RNA-seq library with short insert sizes. If fragments are too short, sequencing through the insert may result in a high rate of adapter sequencing. Degraded RNA samples should not be subjected to poly(A) selection to avoid introducing 3’ end bias (PMID: 25339126). If an RNA sample is sufficiently degraded, the fragmentation step during library construction may also be skipped. When creating libraries from heavily degraded material, the quality of the resulting library should also be carefully examined. For example, by requiring a minimum concentration (e.g., 5 ng/ul) and that the insert size distribution shows the correct range of fragment sizes. Libraries made from heavily degraded RNA may require extra optimization during the cluster formation step of sequencing. |
7 | | Poly(A) selection versus total RNA versus ribosomal reduction (also known as ‘ribo-minus’ or ‘ribo-reduction’) | Prior to sequencing, total RNA must be isolated. Total RNA is dominated by ribosomal RNA (rRNA) sequences, comprising 95-98% of RNA molecules. If they are not efficiently removed prior to sequencing, rRNA reads will dominate the data output. Depending on experimental objectives, there are several options for reducing the proportion of rRNAs to allow sequencing of the rest of the transcriptome (PMID: 23685885). Two common strategies are poly(A) selection and ribo-reduction. Each has advantages and disadvantages (PMID: 24888378). In poly(A) selection, a solution of oligo(dT) probes is used to capture the poly(A) tail at the 3’ end of mature, processed mRNA.
In performing poly(A) selection, one is enriching for mature mRNA molecules, leaving behind the pre-processed mRNA as well as other non-coding RNA. In ribosomal reduction, oligonucleotides homologous to the ribosomal RNAs are used to capture ribosomal RNA that are then removed, enriching for all other RNA species. This procedure will yield sequence reads for non-coding RNA, pre-processed RNA, and other functional RNA molecules like tRNAs. While this data tends to be noisier, it also gives a more broad representation of the transcript classes that make up the transcriptome. |
8 | | Fragmentation | RNA-seq involves sequencing of cDNA fragments that are usually ~250-450 nucleotides long. The average length of RNA molecules in many species is at least 5-10 times this size. Large RNAs must therefore be fragmented prior to sequencing and the full length structure of RNAs must be inferred during analysis by assembly of overlapping sequences. Fragmentation is performed directly on the RNA or after conversion to cDNA. RNA fragmentation may be achieved by an enzymatic process (e.g. RNAases), a chemical process (e.g. exposure to metal ions), or a physical process (e.g. exposure to heat or shearing by sonication). cDNA fragmentation may similarly involve an enzymatic process (DNAases), nebulization or sonication. To obtain a distribution of fragments in a specific size range, fragmentation is often followed by size selection. |
9 | | Size selection (narrow versus broad size selection versus small-RNA sequencing) | There are two size selection strategies for obtaining cDNA fragments of a size range suitable for RNA sequencing. In the first strategy, a tight size range may be selected (by polyacrylamide gel electrophoresis ‘PAGE’ for example) to produce a distribution with a small variance in size (known as a ‘tight’ distribution). This allows for efficient cluster formation on a flow cell, leading to a higher data yield from each run. It also allows algorithms downstream to infer more about the structure of RNAs based on any observed deviation from the expected insert size. A small size range reduces the number of possible unique fragments that can be generated from each RNA species and therefore could reduce overall library complexity and sequence content. In the second strategy, only small RNA species are removed using a simple column clean up that is more amenable to automation in the lab (PMID: 22973283). This leaves a much broader distribution with a long ‘tail’ of larger RNAs. During analysis, this strategy prevents strong inferences based on calculated insert sizes but the wider diversity of fragments may provide increased sequence complexity and may allow mapping in certain ambiguous regions that might otherwise be difficult to align to. Despite a wide range of sizes, the process of cluster formation and sequencing may be biased towards certain sizes fragments (likely smaller fragments) and therefore the observed size distribution in sequence reads may be shifted relative to estimates of fragment size obtained prior to sequencing. It should be noted that in both the strategies described above, very small RNAs such as miRNAs are lost. These small RNA species are typically sequenced by an independent small RNA sequencing strategy that specifically targets RNA species in the ~20-150 bp range (or often a further subset of this range). |
10 | | Linear or exponential amplification of low-input samples | To allow for small amounts of input material, certain RNA-seq library construction strategies incorporate an up front linear or exponential amplification step. Examples of this type of strategy include: ‘Smart-seq’, ‘DP-seq’ and ‘CEL-seq’ (PMID: 23685885). The initial amplification strategy is in addition to the exponential PCR amplification that is a routine part of sequence library construction. Any amplification is potentially undesirable as it introduces biases that may mask subtle or even moderate biologically significant differences in RNA expression between conditions (PMID: 24419370). However, in the case of extremely low input, some amount of amplification may be required to allow RNA-seq library construction.
Linear amplification involves incorporation of an additional adapter sequence containing a promoter sequence that allows a polymerase (often T7 RNA polymerase) to generate copies. The high binding affinity of this enzyme for its promoter sequence is meant to minimize generation of artifactual products that distort expression measurements during analysis; however, this approach has been found to introduce considerable variability at low RNA input levels (PMID: 24419370). During analysis, an additional trimming step is required to remove these promoter sequences. When an initial amplification step is required, additional technical and biological replicates should be considered and greater emphasis placed on data QC during analysis. |
11 | | Library normalization | RNA occur at varying abundances in a cell. These abundances can vary as much as 105-107, orders of magnitude from the rarest to most abundant transcripts (PMID: 18978789, 18978772). Since RNA-seq works by random sampling, a typical RNA-seq library is often dominated by reads from the most abundantly expressed genes. With respect to gene expression studies, this is arguably the correct outcome. In studies where measuring the abundance is not as critical as resolving the structures of RNA transcripts, annotating a new genome, or discovering novel RNA fusions, it may be desirable to normalize the library prior to sequencing. Library normalization in this context is any attempt to even out the abundance of transcripts such that the probability of obtaining reads from lowly expressed transcripts and highly expressed transcripts is more balanced. Several RNA-seq library normalization strategies have been proposed (PMID: 22988256). In a completely normalized library, the probability of obtaining reads from all expressed loci would be equal (after correcting for their varying sizes, biases related to GC content, etc.). Duplex-specific normalization (DSN) is one example of a normalization strategy used in RNA-seq library construction. It relies on use of a duplex-specific thermostable nuclease enzyme that preferentially cleaves DNA duplexes and DNA-RNA heteroduplexes. In this strategy, a sequencing fragment library is denatured and partially reannealed before addition of this enzyme. More abundant sequences reanneal more rapidly, and therefore are more heavily degraded by the enzyme, reducing their relative abundance in the library. Note that ‘library normalization’ described here should not be confused with ‘data normalization’ that seeks to enable accurate comparisons of expression levels between and within samples by adjusting for systematic biases in the data (i.e. adjusting expression estimates) (PMID: 22988256). Though differences in library normalization efficiency between libraries could be one source of bias that might be addressed by data normalization. |
12 | | Exome capture of RNA-seq libraries (and other attempts to recover low quality degraded RNA material) | One strategy that may be employed to normalize or ‘rescue’ RNA-seq libraries created from degraded RNA input is to subject them to exome capture (also known as ‘cDNA capture’). This approach improves the relative representation of lowly expressed transcripts and concentrates read coverage over the exons targeted by the capture array while reducing the proportion of reads aligning to intronic and intergenic regions. As with all normalization strategies, this approach could reduce the accuracy of expression estimates. On the other hand, for highly degraded samples (e.g. from FFPE material) it can substantially increase the quality of transcript assemblies compared to uncaptured data (PMID: 24814956). Another method found to be suitable for highly degraded FFPE material is the ‘RNase H’ method (PMID: 23685885). The ‘TruSeq RNA Access’ kit from Illumina is an example of a commercially available kit that implements the cDNA capture concept. |
13 | | Strand specific versus unstranded RNA-seq libraries | RNAs are transcribed by RNA polymerases in a 5’ to 3’ direction. For the most part, transcription occurs using only a single strand of the double stranded DNA template at any particular locus. However, there are significant portions of the genome where transcription in opposite directions overlaps at the beginning or ends of some genes. Furthermore, transcription of certain genes (e.g. miRNAs) may occur from within the intron of another gene on the opposite strand. In many early RNA-seq library construction strategies, knowledge of which strand had been transcribed was lost. These libraries are referred to as ‘un-stranded’ libraries. In these libraries we can not definitively know which strand was being transcribed by RNA polymerase from the genomic DNA template. However, by comparing the position of a read and coverage pattern in that region to known transcript annotations we can often infer the likely direction/strand of transcription. Furthermore, for reads that span across exon-exon junctions, we can compare the observed splice site sequences to that expected for canonical splicing and the strand can often be inferred accurately for these junction spanning reads. Strand specific RNA-seq libraries have the advantage that they maintain the transcription strand info by ligating different RNA adapters on the 5’ and 3’ ends of each RNA molecule prior to cDNA synthesis. This increases the accuracy of alignment and allows us to independently measure transcription occurring on opposite strands at the same genomic position. Genome browsers capable of visualizing RNA-seq alignments (such as IGV) will often have a setting that allows reads to be colored according to the strand. Read aligners (such as TopHat (PMID: 19289445, 23618408)) and expression estimating tools (such as Cufflinks (PMID: 20436464), and HTSeq Count (PMID: 25260700)) also have parameters that need to be set to indicate the strandedness of the RNA-seq library (see Figure 6 and for examples). |
14 | | Indexing and pooling of multiple RNA-seq libraries | ‘Indexing’ in the context of RNA-seq refers to the optional use of a short linker sequence, often a hexamer (or octamer), that is added to the cDNA fragments during library construction prior to sequencing. The index sequence is also known as a ‘barcode’. The index may be added to one or both ends of the cDNA fragment during RNA-seq library construction. Typically a unique index corresponds to each of several distinct RNA samples. Once indexed, RNA samples can be mixed, sequenced as a pool and separated during the analysis by a process known as demultiplexing. Accurate demultiplexing relies on exact or near exact matching of the observed index sequence to that expected for each library/sample. Occasional errors will result in some sequences that can not be demultiplexed and these reads are effectively lost to the analysis unless a custom pre-processing strategy is employed. Once data has been demultiplexed, the index sequence is removed and analysis normally proceeds as it would if no indexing was performed. However, in some cases where short fragments are sequenced and the length of the read exceeds the insert size, it may be possible for index sequences to wind up in the final read sequence.
A multiplexing strategy allows finer control over the amount of data produced for each RNA-seq library. For example, a single lane of RNA-seq data may be divided among 4 or more RNA-seq libraries. With current Illumina protocols, up to 96 samples can be indexed and pooled. The choice to index and pool prior to sequencing is generally driven by the desire to sequence several samples at a depth lower than what is available in a single lane of the instrument (the basic unit of data production). A good rule of thumb for RNA-seq analysis is that if you want only gene expression estimates (similar to what you would get from a microarray experiment) you will want at least ~30-50 million reads of data for each sample (PMID: 25271838). At current data production levels this means that 4-6 samples may be indexed and sequenced within a single lane of Illumina HiSeq 2000 (or equivalent). The sequencing depth, number, and type of replicates are critical to differential gene expression estimates and tools have been created to help design RNA-seq experiments (PMID: 25271838, 23314327). If analysis goals include more detailed analysis such as transcriptome assembly, alternative splicing analysis and single nucleotide variant profiling, the number of reads and replicate libraries is less well understood. Based on our own data, we recommend up to ~250 million reads per sample (possibly even more for robust profiling of lowly expressed transcripts).
Note: the ‘indexing’ described here that is used to allow concurrent sequencing of multiple samples in a single lane should not be confused with ‘molecular indexing’ where individual cDNA fragments are labeled to allow each molecule to be tracked from the original sample through sequencing (PMID: 24449890). |
15 |
16 |
17 |
18 |
19 |
--------------------------------------------------------------------------------
/manuscript/supplementary_tables/supplementary_table_4.md:
--------------------------------------------------------------------------------
1 | ###Supplementary Table 4. Description of RNA-seq library enrichment strategies
2 | A description of three RNA enrichment strategies is provided along with their anticipated effects on RNA-seq library construction and data interpretation. For a visual depiction of the concepts discussed here, refer to Figure 4.
3 |
4 | ####Enrichment strategy
5 |
6 | | Description | Total RNA | rRNA reduction | PolyA selection | cDNA capture |
7 | | ----------- | --------- | -------------- | --------------- | ------------ |
8 | | **General description** | Total RNA is isolated from cells or homogenized tissue. In many species, ribosomal RNA (rRNA) comprises as much as ~95-98% of all RNA molecules. For this reason, total RNA is rarely used for RNA-seq without first conducting an enrichment of some kind. | A strategy that attempts to capture rRNAs by hybridization to specific oligonucleotides. While rRNAs are immobilized, all other RNA molecules are washed through and used as input for RNA-seq library construction. The enrichment for RNAs of interest in indirect in this strategy. | A strategy that attempts to directly capture RNAs containing a polyA tail by hybridization to specific oligonucleotides. While polyadenylated RNAs are immobilized, all other RNA molecules are washed away. After washing, the polyadenylated RNAs are eluted and used as input for RNA-seq library construction. | A strategy that attempts to directly capture RNAs homologous to known exon sequences, for example by using an exome capture reagent. While this set of RNAs are immobilized, all other RNA molecules are washed away. After washing, the captured sequences are eluted and used as input for RNA-seq library construction. |
9 | | **Description of oligo/capture molecules** | No capture technique is used. | Sequences complementary to the rRNA transcripts of the species of interest. In human, rRNAs are divided into: 5S (120 bases in length), 5.8S (160 bases), 18S (~1.9 kb), and 28S (~5 kb) RNAs. Oligonucleotides in rRNA reduction kits attempt to target each of these rRNA sequences. | Oligonucleotides consisting of a series of thymine nucleotides complementary to the polyA tail of mature mRNA molecules. These sequences, often of 18-20 bases in length are referred to as oligo(dT)s. | cDNA capture sets can vary considerably, but consist of a broad selection of oligonucleotides with sequences complementary to known transcript sequences. In the case of an ‘exome’, designed oligos may cover a significant portion of all known exons for the species of interest. |
10 | | **Transcriptome representation** | Most broad. Close to an unbiased, complete representation. | Broad representation. | Focused transcriptome representation (polyA RNAs only). Varies. | Most focused transcriptome representation (targeted sequences only). |
11 | | **Effect on rRNAs** | Total RNA contains large amounts of rRNA along with all other classes of RNAs | Low rRNAs. rRNAs may be reduced by 60-90% or more. | Very low rRNAs. rRNAs may be reduced by 80-90% or more. | Very low rRNAs. rRNAs may be reduced by 80-90% or more. |
12 | | **Effect on abundant RNAs (other than rRNAs)** | While rRNAs dominate in total RNA, the remaining RNAs of other classes occur at widely different proportions, varying by at least 5-6 orders of magnitude from the least to most abundant transcript. | rRNA reduction should not effect the relative abundance of RNAs other than the rRNAs. | PolyA selection will result in an enrichment of polyA RNA molecules at the expense of rRNAs and all non-polyadenylated RNA (including many non-coding RNAs). However, within the polyA RNAs, relative abundance differences should remain unchanged. | cDNA capture will result in an enrichment of all RNA sequences targeted by the capture reagent at the expense of rRNA and all other RNAs that are not targeted. The difference between the most highly and lowly expressed transcripts that are targeted by the capture reagent may also be reduced. This ‘compression’ of dynamic range occurs because highly expressed genes may ‘saturate’ the corresponding capture probes. |
13 | | **Effect on rare RNAs** | Rare RNAs would be extremely difficult to observe by RNA-seq of total RNA because it works by random sampling of fragments. Rare RNA molecules would have a very low probability of being sequenced. Almost all sequenced fragments would align to rRNAs. | rRNA reduction focuses the total pool of RNA-seq reads onto all RNA classes other than rRNA. However, among the very diverse pool remaining, rare transcripts are still difficult to observe by random read sampling. | PolyA selection focuses RNA-seq reads onto an even narrower subset of the transcriptome than rRNA reduction. This improves the ability to detect rare polyA transcripts but the most highly expressed transcripts will still dominate. | cDNA capture is the most focused strategy described here and most reads will correspond to a target of interest. Furthermore the most abundant transcripts targeted by the capture are reduced, increasing the ability to sequence very rare transcripts. |
14 | | **Effect on genomic DNA (gDNA) contamination** | Unaffected. Any gDNA contamination remaining after RNA isolation (and likely DNAse treatment) would be sequenced. | rRNA reduction is not expected to substantially affect the overall level of gDNA contamination. | Since gDNA sequences are not polyadenylated, their relative presence should for the most part be reduced following selection for polyA transcripts. However, regions of the genome with polyA stretches may also be inadvertently captured by oligo(dT) probes. | Overall, gDNA contamination will be reduced by cDNA capture except for gDNA fragments that substantially overlap the targeted sequences of the capture reagent. Signal from intergenic and intronic reads should be substantially reduced. |
15 | | **Effect on unprocessed RNA (also known as hetero-nuclear RNA, hnRNA)** | Unaffected. Any unprocessed RNA contamination remaining after RNA isolation would be sequenced. | rRNA reduction is not expected to substantially affect the overall level of unprocessed RNA contamination. | Unprocessed RNA should be significantly depleted. Since polyA tail addition occurs near the end of RNA transcription when the transcript emerges from an RNA polymerase complex, performing a polyA selection will tend to enrich for mature mRNAs that have been completely processed. | cDNA capture probes generally target the exons of known transcripts directly. For this reason, intronic RNA-seq reads corresponding to unprocessed RNA should be reduced. Signal from unprocessed RNA may still be considerable near the edges of targeted exons. |
16 |
--------------------------------------------------------------------------------
/manuscript/supplementary_tables/supplementary_table_5.md:
--------------------------------------------------------------------------------
1 | ###Supplementary Table 5. Strand related settings for RNA-seq tools that must be adjusted to account for library construction strategy
2 |
3 | The following table provides further explanation of IGV’s read orientation codes for RNA-seq data viewed in the browser. Also provided are recommended software settings for three additional tools involved in common RNA-seq analysis workflows: TopHat (PMID: 19289445, 23618408), HTSeq (PMID: 25260700), and Picard. Each of these explanations/settings is provided for several commonly used RNA-seq library construction kits that produce either stranded or unstranded data.
4 |
5 | | Library Kit | Stranded | 5p to 3p IGV | TopHat (--library-type parameter) | HISAT2 (--rna-strandness) | HTSeq (--stranded/-s) | Picard (STRAND_SPECIFICITY option of CollectRnaSeqMetrics) |
6 | | ----------- | -------- | ------------ | --------------------------------- | ------ | --------------------- | ---------------------------------------------------------- |
7 | | TruSeq Strand Specific Total RNA | Yes | F2R1 | fr-firststrand | R/RF | reverse | SECOND_READ_TRANSCRIPTION_STRAND |
8 | | NuGEN Encore | Yes | F1R2 | fr-secondstrand | F/FR | yes | FIRST_READ_TRANSCRIPTION_STRAND |
9 | | NuGEN OvationV2 | No | F2R1 or F1R2 | fr-unstranded | NONE | no | NONE
10 |
11 | To identify which ‘--library-type’ setting to use with TopHat (PMID: 19289445, 23618408), Illumina specifically documents the types in the ‘RNA Sequencing Analysis with TopHat’ Booklet. For the TruSeq RNA Sample Prep Kit, the appropriate library type is 'fr-unstranded'. For TruSeq stranded sample prep kits, the library type is specified as 'fr-firststrand'. These post are also very informative: How to tell which library type to use (fr-firststrand or fr-secondstrand)? and How to determine if a library Is strand-specific. Another suggestion is to view aligned reads in IGV and determine the read orientation by one of two methods. First, you can have IGV color alignments according to strand using the ‘Color alignments’ by ‘First-of-pair strand’ setting. Second, to get more detailed information you can hover your cursor over a read aligned to an exon. ‘F2 R1’ means the second read in the pair aligns to the forward strand and the first read in the pair aligns to the reverse strand. For a positive DNA strand transcript (5' to 3') this would denote a fr-firststrand setting in TopHat, i.e. "the right-most end of the fragment (in transcript coordinates) is the first sequenced". For a negative DNA strand transcript (3' to 5') this would denote a fr-secondstrand setting in TopHat. ‘F1 R2’ means the first read in the pair aligns to the forward strand and the second read in the pair aligns to the reverse strand. See above for the complete definitions, but its simply the inverse for ‘F1 R2’ mapping. Anything other than FR orientation is not covered here and discussion with the individual responsible for library creation would be required. Typically ‘RF’ orientation is reserved for large-insert mate-pair libraries. Other orientations like ‘FF’ and ‘RR’ seem impossible with Illumina sequence technology and suggest structural variation between the sample and reference. Additional details are provided in the TopHat manual.
12 |
13 | For HTSeq, the htseq-count manual indicates that for the ‘--stranded’ option, ‘stranded=no’ means that a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For ‘stranded=yes’ and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For ‘stranded=reverse’, these rules are reversed.
14 |
15 | For the ‘CollectRnaSeqMetrics’ sub-command of Picard, the Picard manual indicates that one should use ‘FIRST_READ_TRANSCRIPTION_STRAND’ if the reads are expected to be on the transcription strand.
16 |
17 |
18 |
--------------------------------------------------------------------------------
/manuscript/supplementary_tables/supplementary_table_6.md:
--------------------------------------------------------------------------------
1 | ###Supplementary Table 6. Critical file formats used in RNA-seq analysis
2 |
3 | The following table describes several file formats used in most RNA-seq analysis workflows as well as several files specific to the expression analysis tools used by the online tutorials that accompany this article (at www.rnaseq.wiki).
4 |
5 | | File type | Description |
6 | | --------- | ----------- |
7 | | FASTA (PMID: 3162770) | http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml FASTA files are used to store sequences of DNA with a header that describes its source. It is the preferred format to represent the reference genome sequence needed by analysis algorithms like BLAST, BLAT, bwa, bowtie and TopHat. |
8 | | GTF | http://www.ensembl.org/info/website/upload/gff.html GTF (a constrained version of GFF), or gene transfer format, is a format that describes DNA, RNA, or protein sequences with their chromosome location and basic structural and functional annotations. |
9 | | FASTQ (PMID: 20015970) | http://maq.sourceforge.net/fastq.shtml FASTQ format is a next generation sequencing specific format for storing read sequence data. It includes a read quality score along with FASTA-like sequence information. This format is used to describe each RNA-seq read individually and is an accepted input to most sequence aligners. |
10 | | SAM/BAM (PMID: 19505943) | http://samtools.github.io/hts-specs/SAMv1.pdf SAM (Sequence Alignment Map) is a flexible sequence alignment format used to describe the alignment of sequence reads to a reference genome sequence. BAM is a binary, compressed version of the SAM file used for more efficient storage and access. |
11 | | BED | http://www.ensembl.org/info/website/upload/bed.html BED (Browser Extensible Data) is a file format that is used to store location-annotation genome coordinate pairs to be displayed in a genome browser. |
12 | | junctions.bed (PMID: 19289445, 23618408) | https://www.biostars.org/p/16653/ The junctions.bed file is produced by running TopHat on a set of read sequences. It contains exon-exon junction (and exon boundary) information and counts for all reads spanning two exons across an intron. |
13 | | Cufflinks output files (PMID: 20436464) | https://www.biostars.org/p/16574/ http://cole-trapnell-lab.github.io/cufflinks/cufflinks/index.html#cufflinks-output-files Cufflinks produces two main output files. A transcripts.gtf file, an annotated file of the transcript sequences/structures predicted by Cufflinks by examining RNA-seq read alignments. fpkm_tracking files, that are used to summarize expression values at both the gene and the transcript level. |
14 | | HTSeq output files (PMID: 25260700) | http://seqanswers.com/forums/showthread.php?t=4805 http://www-huber.embl.de/users/anders/HTSeq/doc/count.html HTSeq produces a simple tab delimited output file with raw read counts summarized to the level of specific genome features. Usually the count represents reads that overlap any of the exons of a gene and one value is reported for each gene. |
15 |
16 |
17 |
--------------------------------------------------------------------------------
/manuscript/supplementary_tables/supplementary_table_7.md:
--------------------------------------------------------------------------------
1 | ###Supplementary Table 7. Common RNA-seq questions and their answers.
2 |
3 | The following table summarizes a list of commonly asked questions relating to RNA-seq analysis with links to BioStar (PMID: 22046109) posts where these questions have been addressed by the community.
4 |
5 | | Question | BioStar posts with answer |
6 | | -------- | ------------------------- |
7 | | Should I include biological replicates in my RNA-seq experimental design? If so, how many? | http://www.biostars.org/p/1161/ http://www.biostars.org/p/68885/ Yes. RNA-seq can be used to quantify transcript levels from a sample. In order to perform useful statistics, one sample is not ideal. Replicates must be used to power these statistics. The RNA-seq method is an impressive advancement with many applications for studying RNA biology but it does not eliminate biological variability. If the input samples are heavily degraded or have very low input amounts it may also be advisable to include certain types of technical replicates (e.g. making multiple libraries from each sample). |
8 | | How do I assess the quality of an RNA-seq library? What tools are available? | https://www.biostars.org/p/103090/ Generic quality control programs for sequencing data include: samtools, FastQC, BAMstats and SAMstat. Picard’s CollectRNASeqMetrics function is also very useful for RNA-seq QC. Many additional metrics for evaluating RNA-seq data quality have been developed (PMID: 25577376). |
9 | | What is 3’ end bias and how might it complicate interpretation of expression estimates? | http://seqanswers.com/forums/showthread.php?t=9839 https://www.biostars.org/p/102812/ It is difficult to produce a library with perfectly uniform coverage of RNA-seq reads across the entire length of transcripts. For example, base positions at the extreme ends of transcripts tend to be slightly underrepresented at both the 5’ and 3’ end because there are less cDNA fragments that can be generated from the ends that would cover these positions than in the center of a transcript (PMID: 21410973). The term 3’ end bias in the context of RNA-seq refers to an overrepresentation of read sequences derived from the 3’ end of transcript. This bias towards sequencing the 3’ ends of transcripts can be introduced by certain library construction strategies. In particular if the starting RNA is degraded (or becomes degraded during sample preparation) and the sample is then subjected to polyA enrichment, this will introduce 3’ end bias (PMID: 23685885). If the level of RNA degradation is high, the resulting sequence can be almost entirely focused on the 200-400 bases at the 3’ end of each transcript. Tools such as Picard can produce visualizations and specific metrics to assess the degree of end bias in an RNA-seq data set and specific methods have been proposed to correct for positional bias in RNA-seq expression estimation (PMID: 21410973). |
10 | | What does ‘Fragments Per Kilobase Of Exon Per Million Fragments Mapped’ (FPKM) mean? | https://www.biostars.org/p/68126/ FPKM is an expression estimate that attempts to normalize for differences in library sequence depth between samples and differences in gene size between genes. FPKM is a similar metric to Reads Per Kilobase of transcript per Million (RPKM). However, FPKM values use the count of cDNA fragments, not reads. Various sequencing platforms can generate single or paired end reads, introducing ambiguity in the mapping from reads to fragments. FPKM values attempt to resolve this ambiguity by using the fragment of cDNA as the smallest unit. Cufflinks is an example of a tool that generates FPKM values for genes and transcripts/isoforms (PMID: 20436464). |
11 | | How are individual reads assigned to specific transcripts/isoforms when calculating FPKM? | https://www.biostars.org/p/16649/ The problem of assigning individual reads to specific isoforms or transcripts is a challenging one. Current popular solutions take many inferences into account in displaying isoform structures with read counts. Some of the ambiguity in this problem can be resolved by local differences between isoform structures that can be mapped uniquely, but caution should be taken before interpreting the FPKM values for specific isoforms with large, complex splicing patterns. |
12 | | How do I find novel splicing events/transcripts? What tools are available for alternative splicing detection from RNA-seq data? | https://www.biostars.org/p/68966/ https://www.biostars.org/p/65617/ This problem is still being actively addressed. Separating the problem into subtasks can be useful. Breaking up the alignment, assembly and transcript calling and quantification may lead to a cleaner solution, and many tools are available for these tasks at the links above. |
13 | | What is a duplicate read? | http://seqanswers.com/forums/showthread.php?t=6854 http://sourceforge.net/p/picard/wiki/Main_Page/ https://www.biostars.org/p/107402/ Duplicate reads are two or more reads that are assumed to be derived from the same nucleotide fragment and therefore do not represent independent transcriptome information from the sample being sequenced. Duplicate reads are identified by algorithms that examine position sorted BAM files. Typically, for paired-end read data (single-end data is also handled) these algorithms find the 5p coordinates and mapping orientations of each read pair while taking into account all clipping that has taking place as well as any gaps or jumps in the alignment. All read pairs sharing identical 5p coordinates and orientations are marked as duplicates except the "best" pair. Two commonly used tools for duplicate marking/removal are Picard ‘MarkDuplicates’ and samtools ‘rmdup’. Note: This question/answer refers to PCR duplicates, ‘optical duplicates’ are a distinct concept. |
14 | | Should I remove duplicates from RNA-seq libraries? | https://www.biostars.org/p/14283/ Generally no, but the decision to remove duplicates could be made on a case-by-case basis for your dataset. Unlike in DNA sequencing studies, duplicate reads in an RNA-seq sample are much more likely to be real identical fragments of small RNA transcripts with high expression. Removing these would bias the expression distribution of your sample and is not recommended (PMID: 25271838). However, if quantification of expressed transcripts is not the aim of the study, then removing duplicates can cut down on memory usage and computing time for other analyses. Duplicate read removal is a standard practice in WGS and exome sequencing pipelines and involves the identification and marking of read alignments that are deemed identical to each other. Duplicates are typically identified as those read pairs that share identical outer alignment coordinates for both reads of a pair (see previous answer for more details). These identically mapped reads are assumed to be artifacts of PCR amplification derived from the same DNA fragment because the probability of sequencing an identical fragment of DNA from genomic DNA by chance is low. While this assumption holds for DNA (from species with large genomes) it does not hold for RNA. There is a concern that duplicates may correspond to biased PCR amplification of particular fragments, however, for highly expressed or short genes, duplicates are expected even if there is no amplification bias. Removing them will reduce the dynamic range of expression estimates. Generally duplicates should therefore not be removed in RNA-seq analysis. However, in some situations (such as mutation calling) one might decide to remove them. |
15 | | Should I trim RNA-seq reads. What trimming tool should I use? | https://www.biostars.org/p/84305/ Read trimming may be advisable in certain circumstances depending on the results of QC analysis of the data. For example, if there is a considerable drop in base quality near the 5’ end of the reads, then quality trimming can be used to remove bases with an increased probability of containing errors. If too many errors are present at the ends of reads this may reduce overall alignment rates. If the RNA-seq library contains cDNA inserts that are shorter than the target read length, sequencing may run into the sequencing adapters used by the sequencing platform. These sequences may prevent reads from mapping to the reference genome. These reads can be fixed by adapter trimming with the known sequence of each sequencing adapter. Finally, if the RNA-seq library construction procedure involved an amplification step that required addition of an additional adapter sequence (e.g. T7 promoter or SPIA adapter) then additional adapter trimming may be advisable. Several read trimming tools are available for next generation sequence data including: Skewer (PMID: 24925680) and Trimmomatic (PMID: 24695404). |
16 | | How do I detect gene fusions in RNA-seq data? What tools are available? | https://www.biostars.org/p/45986/ Gene fusions are mostly analyzed in the context of cancer transcriptomes, where several prominent oncogenic fusion proteins are well described (e.g. BCR-ABL). Gene fusions are detected by identifying RNA-seq reads that indicate that portions of two genes (geneA-geneB) at physically separated genomic loci are expressed as a single unit. Since transcription normally occurs as a linear event in the 5’ to 3’ direction along a single continuous DNA molecule, such fusions observed at the RNA level may imply the presence of a structural variation (e.g. interchromosomal translocation) at the DNA level. RNA-seq reads that support a fusion are typically of two categories: spanning and encompassing. A spanning read is one where a single read sequence matches for part of its length to geneA and matches geneB for the remainder. The edges of these alignments to geneA and geneB often correspond to the edge of known exons. An encompassing read is one where read 1 of a read pair matches geneA and read 2 of the same read pair matches geneB. The details for many published fusion detection tools are available at the URL above. |
17 | | How do I visualize alternative splicing events in RNA-seq data? | https://www.biostars.org/p/8979/ Alternative splicing events are often visualized in genome browsers such as IGV (PMID: 22517427) by observing the splice junction spanning reads in the read alignment track, or by loading a ‘junctions.bed’ file that summarizes read counts supporting exon-exon junctions, or by using a genome browser plugin such as the ‘Sashimi plot’ module (PMID: 25617416) in IGV. Detailed protocols for visualizing alternative splicing in the genome browser IGB (PMID: 19654113) have also been developed (PMID: 24695404). Additional options are discussed in the biostars post linked above. |
18 | | How much RNA-seq data should I generate? How much total coverage do I need? | https://www.biostars.org/p/65501/ The question of how much coverage is necessary for an experiment is very difficult to answer and depends on experimental goals (PMID: 24434847). A common target used at our center is that at least 10,000 transcripts have at least >20x coverage over at least 50% of their known exon-exon junctions. This is usually obtained by a 200-300 million read run of 1-2 lanes of Illumina HiSeq data (40-100 Gb). The ENCODE consortium and other large scale sequencing initiatives have also published guidelines on this question (Supplementary Table 8). A more precise answer to this question depends on a number of factors, but the most important of these is the analytical question being asked of the data (PMID: 24434847). For example, the experiment may call for gene expression estimates, de novo transcriptome assembly, alternative expression analysis, or fusion detection. Published reports have argued that as little as 10 million reads are sufficient for gene expression estimation for each sample (PMID: 24319002). While there are clear statistical benefits to additional samples at the expense of deeper data on each sample (PMID: 21747377, 24020486), these estimates often assume that gene expression estimates are the only desired output of an RNA-seq experiment. Fusion detection, alternative expression analysis and other analysis strategies place higher demands on library depth for each sample. The optimal target sequence depth may also depend on the tissue type being profiled, method of RNA isolation, quality of input RNA, library construction method, and other experimental design factors (Supplementary Table 3). Furthermore, sequencing parameters such as read length or choice of paired versus unpaired read types influence read alignment efficiency and therefore may influence the total amount of read data needed. Given the number of factors involved, there is no single right answer as to the amount of RNA-seq data needed. One strategy for setting this experimental design parameter is to base the decision on comparison to existing publications with similar goals. A more reliable approach is to determine analysis goals, identify metrics that measure the desired output (genes detected, exon-exon junctions resolved, etc.) and conduct a pilot experiment where a small subset of your libraries are sequenced deeply. The resulting data can then be analysed, saturation curves produced and the amount of data needed can be determined by a return on investment analysis (PMID: 23314327). |
19 | | Which aligners are optimized for RNA-seq and which should I use? | https://www.biostars.org/p/60478/ TopHat (PMID: 19289445, 23618408) is a popular choice for RNA-seq alignment. STAR is an alternative that produces similar alignments more quickly. If reads are being aligned against a reference genome sequence, the aligner used should be a gapped aligner that is aware of splicing patterns for the species being sequenced. If reads are being aligned directly to a database of transcript sequences, a faster aligner that is not splice aware may be used. Many alternatives to TopHat are available (PMID: 24185836), each with their own benefits and shortcomings. A large list of such aligners is maintained at the EBI HTS aligner list (rna-seq aligners are indicated in red) (PMID: 23060614). The optimal alignment strategy depends on read length and the availability or choice of reference sequences that the reads are being aligned to. If read lengths are sufficiently long (>75 bp) and they are being aligned to a reference genome sequence, a gapped or ‘splice aware’ aligner such as TopHat (PMID: 19289445, 23618408), STAR (PMID: 22645380), MapSplice (PMID: 20802226), GSNAP (PMID: 20147302), HISAT or others should be used for a eukaryotic species where exon sequences may be separated by large introns that must be resolved during alignment. If read lengths are < 50 bp it may be advisable to use an ungapped aligner like BWA or Bowtie to align reads to a reference genome combined with an exon-exon junction database (PMID: 20835245). In this strategy, the junction database needs to be tailored to read length. In the absence of a reference genome sequence, RNA-seq reads can be aligned directly to a database of transcript sequences using an ungapped aligner. In the absence of a reference genome sequence or reference transcriptome database, de novo transcriptome assembly may be attempted with tools such as Trans-ABySS (PMID: 20935650) or Trinity (PMID: 21572440). For some species such as human, the reference genome and transcriptome resources available are of high quality, having been created by extensive efforts involving gold standard sequencing and analysis techniques. Use of a reference genome and transcriptome to guide and inform the analysis is highly recommended where possible. De novo assembly and de convolution of alternative isoforms are difficult problems compared to alignment of reads to a high quality reference genome sequence and comparison to a database of known transcripts (PMID: 25608678). De novo transcriptome assembly may be used to compliment transcript discovery workflows that are guided by existing reference genome and transcriptome sequences. If these resources do not exist for a particular species, their creation should be considered a high priority. |
20 | | Is one alignment strategy sufficient for all downstream analysis needs? | Unfortunately, some tools for certain RNA-seq analysis applications have been carefully tuned to expect certain very specific alignment strategies. For example, one transcript abundance tool might expect alignments performed against a reference genome sequence while another might expect alignments performed against a database of transcript sequences. Fusion detection algorithms may rely on alignments that report many alternative alignments. Mutation calling tools might expect a BAM with duplicates marked while most other applications will not be affected by or require duplicate marking. Some RNA-seq aligners do not report small insertions or deletions very well and this will interfere with detecting variants of this type. Some aligners may not report alignments that span across two chromosomes, and this will also prevent detection of fusions. For these reasons and more one should consider carefully the alignment requirements of each analysis application and accept the reality that aligning the same data more than once by different methods might be a practical necessity in a comprehensive analysis pipeline. |
21 | | Should I allow multiple alignments for each read? | The answer to this question depends on the application. In DNA analysis it is common to use an alignment strategy that randomly selects one alignment from a series of equally good alignments. In RNA-seq analysis this is less common. When aligning RNA-seq reads against a transcript sequence database, multiple equally good alignments will be expected for genes with several isoforms that share common sequences. Some transcript abundance estimation tools (e.g. Cufflinks (PMID: 20436464)) specifically expect to use multiple mappings to a transcriptome or genome sequence in their estimations. Correctly representing the uncertainty of mapping for reads that correspond to multiple isoforms or regions of the genome has been found to increase the accuracy of transcriptome abundance estimation (PMID: 18516045, 20022975). In other words, allowing more multiple alignments is desirable in this context, though it will increase the size of RNA-seq BAM files. Similarly, in gene fusion discovery, allowing a larger number of alignments for each read can improve the ability of the fusion detector to correctly identify false positive fusions. One use case where one might choose to ignore multi-mapped reads is when performing mutation discovery with RNA-seq data. In this application, it might be best to align reads to the genome with an accurate gapped aligner and assign multi-mapped reads a mapping quality of 0 so that they can be easily ignored by variant callers interrogating the BAM file. |
22 | | Where do I obtain reference genome sequences (FASTA files) for my species of interest? | https://www.biostars.org/p/1796/ https://www.biostars.org/p/103359/ Reference genome sequences are generally obtained as a set of FASTA sequences representing the results of a genome sequencing and assembly initiative. The assembly consists of multiple contig sequences that each represent an entire chromosome or pieces of chromosomes depending on the degree of completion of the genome assembly. There will often be multiple versions of the genome assembly that represent ongoing improvements (e.g. hg17, hg18, hg19). Many species have a dedicated reference genome consortium and may operate an independent data portal where these sequences can be downloaded. Furthermore UCSC, Ensembl, and NCBI each act as centralized portals where reference genome sequences can be obtained for multiple species. Finally, the iGenomes project is hosted by Illumina and attempts to provide reference sequences that have been pre-indexed and organized for certain RNA-seq analysis workflows. |
23 | | Where can I obtain reference transcript sequences (GTF files) for my species of interest? | https://www.biostars.org/p/108359/ Transcriptome databases contain predicted and/or experimentally validated RNA transcript sequences that have been annotated against the reference genome sequence to resolve exon/intron boundaries. Additional functional annotations may also be available for each transcript sequence or gene locus. Transcript sequences are often made available as a FASTA file and annotations of those transcripts against the reference genome (including exon coordinates on the reference genome) will be provided as a GTF or GFF file (Supplementary Table 6). The same organizations described in the previous question that make the reference genome sequences available also make these transcriptome databases available for download. |
24 | | Where can I obtained publicly available RNA-seq datasets? | http://www.ncbi.nlm.nih.gov/geo/ https://www.biostars.org/p/46059/ https://www.biostars.org/p/52866/ http://seqanswers.com/forums/showthread.php?t=20469 The largest repository of publicly available RNA-seq datasets is probably the Gene Expression Omnibus hosted by NCBI NLM. Other sources are discussed in the links provided above. |
25 | | Where can I find a “gold standard RNA-seq data set” for differential expression analysis? | https://www.biostars.org/p/78229/ The experimental data reported in the ALEXA-seq publication is likely still the most in-depth validated data set publicly available (PMID: 20835245). The GEO accession for this data is GSE23776. This data contains ~200 differentially expressed exons validated by qPCR, and another ~200 alternative splicing structures validated by RT-PCR and Sanger sequencing. An additional data set compared various RNA-seq protocols to qPCR data for 40 genes (PMID: 24419370). |
26 | | How can I generate a custom isoform structure diagram (exon/intron boundaries)? | https://www.biostars.org/p/17841/ This is possible in R, Perl or Bioperl graphics utilities. Online tools exist as well. GenomeGraphs and the ExonIntron tool are two such applications. |
27 | | Where can I find a list of RNA-seq review papers? | https://www.biostars.org/p/52152/ In addition to the biostars link provided we have created a resources page that contains many useful papers and other RNA-seq references at: www.rnaseq.wiki. |
28 | | General discussion of RNA-seq analysis pipelines and best practices | https://www.biostars.org/p/6615/ In addition to the BioStar URL, we provide additional references relating to analysis pipelines and best practices in Supplementary Table 8. |
29 | | How do I integrate RNA-seq expression and gene regulation analyses? | https://www.biostars.org/p/11695/ While limited tools currently exist, there is great potential to combined whole genome or exome data generated by sequencing DNA with RNA-seq data generated by sequencing RNA from the same samples. This will allow an unprecedented ability to examine the sequence relationship between common polymorphisms and rare mutations in the DNA with expression levels and splicing patterns in the RNA. |
30 | | How do I obtain read counts for those reads that span across exon-exon junctions? | https://www.biostars.org/p/73832/ If alignments were produced by TopHat (PMID: 19289445, 23618408), the exon-exon junctions and read counts supporting each unique junction will be provided in a ‘junctions.bed’ file in the TopHat output directory. More generally, one could identify alignments in an RNA-seq BAM file that contained CIGAR strings with ‘N’ operators that indicated skipped regions from their reference. A subset of these skipped regions will correspond to introns. These can be identified by examining the edges of the skipped region and using knowledge of splicing patterns in the sequenced species to determine whether it represents a likely intron splicing event. |
31 | | Why are there so many RNA-seq alignments within intronic regions? | https://www.biostars.org/p/42890/ RNA-seq alignments within intron regions can occur for various reasons (PMID: 25113896). First, while it is typical to perform DNAse treatment of RNA samples prior to library construction, these treatments are not complete and some intronic reads may represent genomic DNA that was not successfully removed or degraded. Second, RNA samples will typically contain a mixture of nuclear and cytoplasmic RNA. RNA from the nucleus may be incompletely processed heteronuclear RNA (hnRNA). hnRNA may contain introns that have not yet been spliced out. Third, random transcription events can happen anywhere, including within introns. Fourth, splicing errors or biologically significant alternative splicing may result in isoforms with retained introns. Fifth, the read may be misaligned to the intron. Sixth, if the RNA-seq library is unstranded, such reads might actually correspond to a gene being transcribed on the opposite strand that happens to reside within the intron of another gene. RNA-seq libraries that involve polyA selection will generally enrich for mature mRNA sequences that have been completely processed. This will lead to reduced noise levels within the introns. Another strategy to reduce intron reads might be to perform RNA isolation in a way that enriches for the cytoplasmic compartment or that selects for RNAs being actively translated by a ribosomal complex. Unfortunately, these strategies tend to lead to RNA degradation compared to conventional RNA isolation procedures. |
32 |
33 |
34 |
--------------------------------------------------------------------------------
/manuscript/supplementary_tables/supplementary_table_8.md:
--------------------------------------------------------------------------------
1 | ###Supplementary Table 8. General resources for RNA-seq analysis
2 |
3 | The following table provides a list of general resources to help understand the background of RNA biology, next generation sequencing, RNA-seq laboratory methods, and RNA-seq analysis. Additional educational resources can be found in the resources section of the online tutorial at www.rnaseq.wiki.
4 |
5 | | Resource name and description |
6 | | ----------------------------- |
7 | | SeqAnswers. An online forum for next generation sequencing. |
8 | | BioStars. An online forum for bioinformatics (PMID: 22046109). |
9 | | Illumina videos on the basic of NGS sequencing: video 1, video 2. |
10 | | Molecular Biology of the Cell. From DNA to RNA. A comprehensive introduction to transcription, strandedness, RNA types, gene regulation, RNA polymerase function, splicing, and so on. |
11 | | The RNA-seqlopedia. An overview of RNA-seq and the choices necessary to carry out a successful RNA-seq experiment. |
12 | | RNA-seq Data: Challenges in and Recommendations for Experimental Design and Analysis. |
13 | | RNA Bioinformatics, a 25 chapter book covering many topics relevant to RNA-seq analysis. |
14 | | The RNA-Seq blog. An actively maintained news feed of RNA-seq related developments. |
15 | | HTS Mappers. An actively maintained list of short read aligners (RNA-seq aligners are indicated in red). |
16 | | The periodic table of bioinformatics. A list of commonly used bioinformatics tools. |
17 | | ENCODE standards, guidelines and best practices for RNA-seq. |
18 | | REMC standards and guidelines for RNA-sequencing. |
19 | | A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. |
20 | | List of RNA-seq bioinformatics tools. |
21 | | GEO: The Gene Expression Omnibus (contains many publicly available RNA-seq data sets). |
22 |
23 |
24 |
--------------------------------------------------------------------------------
/manuscript/supplementary_tables/supplementary_table_9.md:
--------------------------------------------------------------------------------
1 | ###Supplementary Table 9. RNA-seq workshops and online tutorials
2 |
3 | The following table lists RNA-seq workshops and other tutorials complementary to this article. These examples are limited to online materials or short workshops. Not listed here are formal training programs or degrees in bioinformatics. For ongoing discussion of this topic, refer to these BioStar (PMID: 22046109) posts: https://www.biostars.org/p/79845/ and https://www.biostars.org/p/11034/.
4 |
5 | | Workshop/Tutorial | URL |
6 | | ----------------- | --- |
7 | | Canadian Bioinformatics Workshops (CBW). Informatics for RNA-sequence analysis. Various additional bioinformatics courses. A live delivery of the tutorials accompanying this publication. | http://bioinformatics.ca/ |
8 | | Cold Spring Harbor Laboratory (CSHL). Advanced Sequencing Technologies & Applications. | http://meetings.cshl.edu/courses/2014/c-seqtec14.shtml |
9 | | Biostars. Discussion of ongoing and upcoming training opportunities. | https://www.biostars.org/p/11034/ https://www.biostars.org/p/79845/ |
10 | | Michigan State University. Analyzing Next-Generation Sequencing Data. | http://bioinformatics.msu.edu/ngs-summer-course-2015 |
11 | | EBI. Advanced RNA-Seq and ChiP-Seq Data Analysis. | http://www.ebi.ac.uk/training/ |
12 | | UC Davis Bioinformatics Training Program. RNA-Seq Workshop: From Pipette to P-value! Bootcamp: Introduction to RNA-seq. | http://training.bioinformatics.ucdavis.edu/ |
13 | | EMBL. Various bioinformatics courses. | http://www.embl.de/training/events/index.php |
14 | | Wellcome Trust. Various bioinformatics courses. | http://www.wellcome.ac.uk/Education-resources/Courses-and-conferences/ |
15 | | ECSEQ Bioinformatics. RNA-seq Bioinformatics: A Practical Introduction. | http://www.ecseq.com/training.html |
16 | | Bioinformatics.org. Various bioinformatics courses. | http://www.bioinformatics.org/wiki/Educational_services |
17 | | Data carpentry. Various data science relevant topics. | http://datacarpentry.org/ |
18 | | Software carpentry. Various software and analysis relevant topics. | http://software-carpentry.org/ |
19 | | HarvardX. Case study: RNA-seq data analysis | https://www.edx.org/course/case-study-rna-seq-data-analysis-harvardx-ph525-5x |
20 | | Princeton RNA-seq workshop | http://www.princeton.edu/genomics/sequencing/instructions/rna-seq-workshop/RNA-seq-introduction-v2.pdf |
21 | | NCBI NOW (Next generation sequencing Online Workshop) | http://www.ncbi.nlm.nih.gov/news/09-30-2015-ncbi-now-next-gen-seq-course/ |
22 | | NCBI comparison of RNA-seq aligners tutorial | https://github.com/NCBI-Hackathons/RNA_mapping/wiki |
23 |
--------------------------------------------------------------------------------
/scripts/Igv_HCC1143_attributes.txt:
--------------------------------------------------------------------------------
1 | NAME
2 | HCC1143.normal.21.19M-20M.bam
3 |
--------------------------------------------------------------------------------
/scripts/Run_batch_IGV_snapshots.txt:
--------------------------------------------------------------------------------
1 | #new
2 | #setSleepInterval 200
3 |
4 | # change this path to point to your files:
5 | #load /Users/asm_work/HT-Seq/igv/data/igv_HCC1143_attributes.txt,/Users/asm_work/HT-Seq/igv/data/HCC1143.normal.21.19M-20M.bam
6 |
7 | snapshotDirectory /Users/asm_work/HT-Seq/igv/screenshots/
8 |
9 |
10 | # Example1:
11 | goto chr21:19479237-19479814
12 | sort strand
13 | maxPanelHeight 50000
14 | snapshot HCC1143_exercise1_SNV_SNP_pair.png
15 |
16 | # Example2:
17 | goto chr21:19,518,412-19,518,497
18 | sort strand
19 | snapshot HCC1143_exercise2_homopolymerRepeat.png
20 |
21 | # Ecample3:
22 | goto chr21:19,611,925-19,631,555
23 | collapse
24 | snapshot HCC1143_exercise3_lowGC.png
25 |
26 | # Example4:
27 | goto chr21:19,666,833-19,667,007
28 | expand
29 | sort base
30 | snapshot HCC1143_exercise4_hetSNP_allelic.png
31 |
32 | # Example5:
33 | goto chr21:19,800,320-19,818,162
34 | collapse
35 | snapshot HCC1143_exercise5_LINErepeat.png
36 |
37 | # Example6:
38 | goto chr21:19,324,469-19,331,468
39 | expand
40 | viewaspairs
41 | snapshot HCC1143_exercise6_homDEL.png
42 |
43 | # Example7:
44 | goto chr21:19,102,154-19,103,108
45 | snapshot HCC1143_exercise7_AluY.png
46 |
47 | # Example8:
48 | goto chr21:19,089,694-19,095,362
49 | snapshot HCC1143_exercise8_translocation.png
50 |
51 | # exit
52 |
--------------------------------------------------------------------------------
/scripts/Tutorial_Module4_ERCC_DE.R:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env Rscript
2 |
3 | #./Tutorial_Module4_ERCC_DE.R /gscmnt/gc2801/analytics/jwalker/RNAseqTutorial/refs/ERCC/ERCC_Controls_Analysis.txt /gscmnt/gc2801/analytics/jwalker/RNAseqTutorial/de/tophat_cufflinks/ref_only/gene_exp.diff
4 |
5 | library(ggplot2)
6 |
7 | args <- commandArgs(TRUE)
8 | erccFile <- args[1]
9 | diffFile <- args[2]
10 |
11 | erccData = read.delim(erccFile)
12 | sortErccData = order(erccData[,'ERCC.ID'])
13 | erccData = erccData[sortErccData,]
14 |
15 | diffData = read.delim(diffFile)
16 | diffIdx = which(diffData[,'gene'] %in% erccData[,'ERCC.ID'])
17 | diffData = diffData[diffIdx,]
18 | sortDiffData = order(diffData[,'gene'])
19 | diffData = diffData[sortDiffData,]
20 |
21 | diffData[,'observed_log2_fc'] = diffData[,'log2.fold_change.'] * -1
22 | diffData[,'expected_log2_fc'] = erccData[,'log2.Mix.1.Mix.2.']
23 | diffData[,'subgroup'] = erccData[,'subgroup']
24 |
25 | okDiffDataIdx = which(diffData[,'status']=='OK')
26 | diffData = diffData[okDiffDataIdx,]
27 |
28 | model <- lm(observed_log2_fc ~ expected_log2_fc, data=diffData)
29 | r_squared = summary(model)[['r.squared']]
30 |
31 | pdf('Tutorial_Module4_ERCC_DE.pdf')
32 | ggplot(diffData, aes(x=expected_log2_fc, y=observed_log2_fc)
33 | ) + geom_point(aes(color=subgroup)
34 | ) + geom_smooth(method=lm
35 | ) + annotate('text', 1, 2,
36 | label=paste("R^2 =", r_squared, sep=' '))
37 | dev.off()
38 |
--------------------------------------------------------------------------------
/scripts/Tutorial_Module4_ERCC_DE.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/scripts/Tutorial_Module4_ERCC_DE.pdf
--------------------------------------------------------------------------------
/scripts/Tutorial_Module4_ERCC_expression.R:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env Rscript
2 |
3 | library(ggplot2)
4 |
5 | args <- commandArgs(TRUE)
6 | filename <- args[1]
7 |
8 | data = read.delim(filename)
9 |
10 | data$logCount = log2(data$Count + 1)
11 | data$logConc= log2(data$Concentration)
12 |
13 | count_model <- lm(logCount ~ logConc, data=data)
14 | count_r_squared = summary(count_model)[['r.squared']]
15 |
16 | pdf('Tutorial_Module4_ERCC_expression.pdf')
17 | ggplot(data, aes(x=logConc, y=logCount)
18 | ) + geom_point(aes(shape=Label)
19 | ) + geom_smooth(method=lm
20 | ) + annotate('text', 5, -3,
21 | label=paste("R^2 =", count_r_squared, sep=' '))
22 | dev.off()
23 |
24 |
--------------------------------------------------------------------------------
/scripts/Tutorial_Module4_ERCC_expression.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/scripts/Tutorial_Module4_ERCC_expression.pdf
--------------------------------------------------------------------------------
/scripts/Tutorial_Module4_ERCC_expression.pl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/perl
2 |
3 | use strict;
4 | use warnings;
5 |
6 | use IO::File;
7 |
8 | my $data_dir = $ENV{RNA_HOME} .'/expression/tophat_counts';
9 | my $ercc_file = $data_dir .'/ERCC_Controls_Analysis.txt';
10 | my $counts_file = $data_dir .'/gene_read_counts_table_all.tsv';
11 | my $ercc_counts_file = $data_dir .'/ercc_read_counts.tsv';
12 |
13 | my $ercc_fh = IO::File->new($ercc_file,'r');
14 | unless ($ercc_fh) { die('Failed to find file: '. $ercc_file) }
15 |
16 | my %ercc_data;
17 | while (my $ercc_line = $ercc_fh->getline) {
18 | chomp($ercc_line);
19 | if ($ercc_line =~ /^Re/) { next; }
20 | #my ($resort,$id,$subgroup,$mix1,$mix2,$fold_change,$log2)
21 | my @ercc_entry = split("\t",$ercc_line);
22 | $ercc_data{$ercc_entry[1]} = \@ercc_entry;
23 | }
24 |
25 | my @labels = qw/UHR_Rep1 UHR_Rep2 UHR_Rep3 HBR_Rep1 HBR_Rep2 HBR_Rep3/;
26 |
27 | my $counts_fh = IO::File->new($counts_file,'r');
28 | unless ($counts_fh) { die('Failed to find file: '. $counts_file); }
29 |
30 | my $ercc_counts_fh = IO::File->new($ercc_counts_file,'w');
31 | unless ($ercc_counts_fh) { die('Failed to open file: '. $ercc_counts_file); }
32 |
33 | my %count_data;
34 | print $ercc_counts_fh "ID\tSubgroup\tLabel\tMix\tConcentration\tCount\n";
35 | while (my $counts_line = $counts_fh->getline) {
36 | chomp($counts_line);
37 | my @count_entry = split(' ',$counts_line);
38 | if ($ercc_data{$count_entry[0]}) {
39 | my $id = $count_entry[0];
40 | my $subgroup = $ercc_data{$id}->[2];
41 | for (my $i = 0; $i < scalar(@labels); $i++) {
42 | my $count = $count_entry[$i+1];
43 | my $label = $labels[$i];
44 | my $conc;
45 | my $mix;
46 | if ($label =~ /UHR/) {
47 | $mix = 1;
48 | $conc = $ercc_data{$id}->[3];
49 | } else {
50 | $mix = 2;
51 | $conc = $ercc_data{$id}->[4];
52 | }
53 | print $ercc_counts_fh $id ."\t". $subgroup ."\t". $label ."\t". $mix ."\t". $conc ."\t". $count ."\n";
54 | }
55 | }
56 | }
57 |
58 |
59 | exit;
60 |
--------------------------------------------------------------------------------
/scripts/Tutorial_Module4_Part2_cummeRbund.R:
--------------------------------------------------------------------------------
1 | #Tutorial_Module3_Part2_cummeRbund.R
2 |
3 | #Malachi Griffith, mgriffit[AT]genome.wustl.edu
4 | #Obi Griffith, ogriffit[AT]genome.wustl.edu
5 | #The Genome Institute, Washington Univerisity School of Medicine
6 | #R tutorial for CBW - Informatics for RNA-sequence Analysis
7 |
8 |
9 | # Optional:
10 | # Install cummeRbund library - this should have been done already
11 | #source("http://bioconductor.org/biocLite.R")
12 | #biocLite("cummeRbund")
13 |
14 | # Load cummeRbund library
15 | library(cummeRbund)
16 |
17 | #A recent overhaul of RSQLite (update to version 1.0.0 on October 25th, 2014) broke number of cummeRbund functions
18 | #This version of the package no longer contains the function used by cummeRbund called "sqliteQuickSQL" and others
19 | #The authors of cummeRbund are working on fixes
20 | #See http://seqanswers.com/forums/showthread.php?t=47785&page=2
21 | #The following is a temporary workaround
22 | sqliteQuickSQL<-dbGetQuery
23 | dbBeginTransaction<-dbBegin
24 |
25 | # Set the paths to cuffdiff/cuffmerge output
26 | # Change these paths if you wish to produce cummeRbund output for different cuffdiff runs (e.g., using STAR alignments)
27 | refCuffdiff="~/workspace/rnaseq/de/tophat_cufflinks/ref_only"
28 | gtfFilePath="~/workspace/rnaseq/expression/tophat_cufflinks/ref_only/merged/merged.gtf"
29 | genomePath="~/workspace/rnaseq/refs/hg19/fasta/chr22_ERCC92/chr22_ERCC92.fa"
30 | outfile="~/workspace/rnaseq/de/tophat_cufflinks/ref_only/Tutorial_Part2_cummeRbund_output.pdf"
31 | outfile2="~/workspace/rnaseq/de/tophat_cufflinks/ref_only/Tutorial_Part2_cummeRbund_output_extras.pdf"
32 |
33 | # read in Cufflinks output
34 | cuff <- readCufflinks(dir=refCuffdiff,rebuild=T,gtfFile=gtfFilePath,genome=genomePath)
35 |
36 | # show the data structures contained in cuff object
37 | cuff
38 |
39 | #Set pdf device
40 | pdf(file=outfile)
41 |
42 | # Plot #1 - A density plot of FPKM across sample replicates
43 | densRep<-csDensity(genes(cuff),replicates=T)
44 | densRep
45 |
46 | # Plot #2 - A box plot of FPKM across sample replicates
47 | brep<-csBoxplot(genes(cuff),replicates=T)
48 | brep
49 |
50 | # Plot #3 - A single scatter comparing UHR vs HBR
51 | sampleScatter<-csScatter(genes(cuff),"UHR","HBR",smooth=T)
52 | sampleScatter
53 |
54 | # Plot #4 - An MAplot of UHR vs HBR
55 | m<-MAplot(genes(cuff),"UHR","HBR")
56 | m
57 |
58 | # Plot #5 - A volcano plot of p-value and fold_change per sample
59 | v<-csVolcano(genes(cuff),"UHR","HBR",alpha=0.05)
60 | v
61 |
62 |
63 | # Plot #6 - Using k-means clustering a dendrogram of the distance between sample replicates
64 | dend.rep<-csDendro(genes(cuff),replicates=T)
65 |
66 | # Plot #7 - A heatmap of sample replicate distace based on JS distance
67 | myRepDistHeat<-csDistHeat(genes(cuff),replicates=T)
68 | myRepDistHeat
69 |
70 | # Plot #8 - Principal Component Analysis of all genes across each sample
71 | genes.PCA<-PCAplot(genes(cuff),"PC1","PC2")
72 | genes.PCA
73 |
74 | # Plot #9 - MDS scaling of all genes across sample replicates
75 | genes.MDS.rep<-MDSplot(genes(cuff),replicates=T)
76 | genes.MDS.rep
77 |
78 | # Get the gene ids of the significant (FDR <5%) differentially expressed genes
79 | mySigGeneIds<-getSig(cuff,alpha=0.05,level='genes')
80 |
81 | # The ids of the first n genes
82 | head(mySigGeneIds)
83 |
84 | # The total number of significant differentially expressed genes
85 | length(mySigGeneIds)
86 |
87 | # Grab a geneSet including only those genes with an FDR <5%
88 | mySigGenes<-getGenes(cuff,mySigGeneIds)
89 |
90 | # Summarize the geneSet data structure
91 | mySigGenes
92 |
93 | # Plot #10 - A heatmap of significant differentially expressed genes
94 | sigHeat<-csHeatmap(mySigGenes,cluster='both',labRow=F)
95 | sigHeat
96 |
97 | # Grab a single gene of interest
98 | myGeneId<-"TST"
99 | myGene<-getGene(cuff,myGeneId)
100 | myGene
101 |
102 | #Summarize the gene and isoform level FPKM values for this gene
103 | head(fpkm(myGene))
104 | head(fpkm(isoforms(myGene)))
105 |
106 | # Plot #11 - gene-level expression levels for UHR vs HBR
107 | gl.rep<-expressionPlot(myGene,replicates=TRUE)
108 | gl.rep
109 |
110 | # Plot #12 - isoform-level expression levels for UHR vs HBR
111 | gl.iso.rep<-expressionPlot(isoforms(myGene),replicates=T)
112 | gl.iso.rep
113 |
114 | #Summarize individual features for this gene
115 | head(features(myGene))
116 |
117 | # Plot #13 - Create a visual representation of the isoforms of a gene
118 | genetrack<-makeGeneRegionTrack(myGene)
119 | plotTracks(genetrack)
120 |
121 | # Plot #14 - Add an ideogram for relevant chromosome and the gene's position on chromosome
122 | #Plot cufflinks features for the gene plus known isoforms for gene region with 2kb flanking region
123 | ###NOTE several of the track plotting functions below are currently broken.###
124 | trackList<-list()
125 | myStart<-min(features(myGene)$start)
126 | myEnd<-max(features(myGene)$end)
127 | myChr<-unique(features(myGene)$seqnames)
128 | genome<-'hg19'
129 | #ideoTrack<-IdeogramTrack(genome=genome,chromosome=myChr)
130 | #trackList<-c(trackList,ideoTrack)
131 | axtrack<-GenomeAxisTrack()
132 | trackList<-c(trackList,axtrack)
133 | genetrack<-makeGeneRegionTrack(myGene)
134 | genetrack
135 | trackList<-c(trackList,genetrack)
136 | #biomTrack<-BiomartGeneRegionTrack(genome=genome,chromosome=as.character(myChr),start=myStart,end=myEnd,name="ENSEMBL",showId=T)
137 | #trackList<-c(trackList,biomTrack)
138 |
139 | #Add conservation levels
140 | #conservation<-UcscTrack(genome="hg19",chromosome=myChr,track="Conservation",table="phyloP100wayAll",from=myStart-2000,to=myEnd+2000,trackType="DataTrack",start="start",end="end",data="score",type="hist",window="auto",col.histogram="darkblue",fill.histogram="darkblue",ylim=c(-3.7,4),name="Conservation")
141 | #trackList<-c(trackList,conservation)
142 | plotTracks(trackList,from=myStart-2000,to=myEnd+2000)
143 |
144 | #Close pdf device - necessary before you can open it in your browser
145 | dev.off()
146 |
147 | #The output file can be viewed in your browser at the following url:
148 | #Note, you must replace __YOUR_IP_ADRESS__ with your own amazon instance number IP (ex. 101.0.1.101)
149 | #http://__YOUR_IP_ADDRESS__/workspace/rnaseq/de/tophat_cufflinks/ref_only/Tutorial_Part2_cummeRbund_output.pdf
150 |
151 | #Additional plot examples to try on your own:
152 |
153 | #Set pdf device
154 | pdf(file=outfile2)
155 |
156 | # generate dispersion plot (observed vs theoretical variance)
157 | disp<-dispersionPlot(genes(cuff))
158 | disp
159 |
160 | # A count based MAplot of UHR vs HBR (use counts instead of FPKM)
161 | mCount<-MAplot(genes(cuff),"UHR","HBR",useCount=T)
162 | mCount
163 |
164 | # A volcano matrix per sample
165 | vMatrix<-csVolcanoMatrix(genes(cuff))
166 | vMatrix
167 |
168 | # A distribution of siginificant features per condition
169 | mySigMat<-sigMatrix(cuff,level='genes',alpha=0.05)
170 | mySigMat
171 |
172 | # PCA of all genes across each sample replicate
173 | genes.PCA.rep<-PCAplot(genes(cuff),"PC1","PC2",replicates=T)
174 | genes.PCA.rep
175 |
176 | #Plot CDS-level expression levels for UHR vs HBR
177 | gl.cds.rep<-expressionPlot(CDS(myGene),replicates=T)
178 | gl.cds.rep
179 |
180 | #Many more... see cummeRbund documentation
181 |
182 | #Close pdf device - necessary before you can open it in your browser
183 | dev.off()
184 |
185 | #The output file can be viewed in your browser at the following url:
186 | #Note, you must replace __YOUR_IP_ADRESS__ with your own amazon instance IP
187 | #http://__YOUR_IP_ADDRESS__/workspace/rnaseq/de/tophat_cufflinks/ref_only/Tutorial_Part2_cummeRbund_output_extras.pdf
188 |
189 | #To exit R type:
190 | quit(save="no")
191 |
192 | #The following plots no longer work in the current cummeRbund. They could possibly be reworked:
193 |
194 | #Barplot of gene-level expression levels for UHR vs HBR
195 | #gb<-expressionBarplot(myGene)
196 | #gb
197 |
198 | #Barplot of gene-level expression levels for UHR vs HBR, by replicate
199 | #gb.rep<-expressionBarplot(myGene,replicates=T)
200 | #gb.rep
201 |
202 | #Barplot of isoform-level expression levels for UHR vs HBR, by replicate
203 | #igb<-expressionBarplot(isoforms(myGene),replicates=T)
204 | #igb
205 |
206 |
--------------------------------------------------------------------------------
/scripts/Tutorial_Module4_Part2_cummeRbund_output.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/scripts/Tutorial_Module4_Part2_cummeRbund_output.pdf
--------------------------------------------------------------------------------
/scripts/Tutorial_Module4_Part3_Supplementary_R.R:
--------------------------------------------------------------------------------
1 | #Tutorial_Part3_Supplementary_R.R
2 |
3 | #Malachi Griffith, mgriffit[AT]genome.wustl.edu
4 | #Obi Griffith, ogriffit[AT]genome.wustl.edu
5 | #The Genome Institute, Washington Univerisity School of Medicine
6 | #R tutorial for CBW - Bioinformatics for Cancer Genomics - RNA Sequence Analysis
7 |
8 | #Starting from the output of the RNA-seq Tutorial Part 1.
9 |
10 | #Install packages and load libraries
11 | #install.packages("ggplot2")
12 | library(ggplot2)
13 | library(gplots)
14 |
15 | #If X11 not available, open a pdf device for output of all plots
16 | pdf(file="Tutorial_Part3_Supplementary_R_output.pdf")
17 |
18 | #### Basic R usage.
19 |
20 | #Lines beginning with "#" are comments. All other lines should be executed *IN ORDER*
21 | #Copy and paste from this document to the R commandline interface
22 | #OR if running R on a Mac use:
23 | #OR if running R on a Windows machine use: r
24 |
25 | #To learn about any command type: ?command_name OR help.search("command_name")
26 | #e.g. ?read.table
27 |
28 | #This tutorial assumes you are running R on your own laptop and therefore the graphics generated can be viewed directly
29 | #If you were running this tutorial in a Linux terminal without X you would write each graph to a file and then open that file
30 | #Every time you execute a 'plot()', 'hist()', 'boxplot()', etc. command, a new window will open to render the graph
31 | #Or if you leave this window open, each time you draw a new graph it will replace the old one
32 | #Similarly, we will often open a graph, and then in subsequent steps add annotation to that graph (legends, labels, etc.)
33 | #I recommend that you just leave the graphics window open and keep viewing it as you go.
34 |
35 | #Test your graphics by running the R graphics demo.
36 | #Press when prompted, then view the graph and press to continue to the next graphic
37 | #try this on your own laptop installations of R (it won't work interactively on the Amazon cloud without X11)
38 | #demo(graphics)
39 |
40 | #Clean up workspace - i.e. delete variable created by the graphics demo
41 | rm(list = ls(all = TRUE))
42 |
43 | #Working with variables in R
44 | x = 5
45 | y = 10
46 | z = x*y
47 |
48 | #Create a sequence of numbers and assign this sequence to a variable called 'my_sequence'
49 | my_sequence = 1:13
50 |
51 | #The contents of any variable can be printed to the screen by simply typing in the name of the variable and hitting
52 | #If you do this for a variable that contains a large amount of data, it may take a while for everything to print out
53 | #If you do that by accident, you can abort a command by pressing
54 | #View the contents of x, y, z, and 'my_sequence'
55 | x
56 | y
57 | z
58 | my_sequence
59 |
60 | #List the variables that exist in your current work space
61 | ls()
62 |
63 |
64 | #### Import the gene expression data from the Tophat/Cufflinks/Cuffdiff tutorial
65 |
66 | #Set working directory where results files exist
67 | working_dir = "~/workspace/rnaseq/de/tophat_cufflinks/ref_only"
68 | setwd(working_dir)
69 |
70 | #List the current contents of this directory - it is empty right now so it will be displayed as 'character(0)'
71 | dir()
72 |
73 | #Import expression and differential expression results from the Bowtie/Samtools/Tophat/Cufflinks/Cuffdiff pipeline
74 | file1="isoforms.read_group_tracking"
75 | file2="isoform_exp.diff"
76 | file3="isoforms.fpkm_tracking"
77 |
78 | #Read in tab delimited files and assign the resulting 'dataframe' to a variable
79 | #Use 'as.is' for columns that contain text/character values (i.e. non-numerical values)
80 | all_fpkm = read.table(file1, header=TRUE, sep="\t", as.is=c(1:2,9))
81 | tn_de = read.table(file2, header=TRUE, sep="\t", as.is=c(1:7,14))
82 | tn_fpkm = read.table(file3, header=TRUE, sep="\t", as.is=c(1:9,13,17))
83 |
84 |
85 | #### Working with 'dataframes'
86 | #View the first five rows of data (all columns) in one of the dataframes created
87 | head(tn_de)
88 |
89 | #View the column names
90 | names(tn_de)
91 |
92 | #Determine the dimensions of the dataframe. 'dim()' will return the number of rows and columns
93 | dim(tn_de)
94 |
95 | #Get the first 3 rows of data and a selection of columns
96 | tn_de[1:3,c(2:4,7,10,12)]
97 |
98 | #Do the same thing, but using the column names instead of numbers
99 | tn_de[1:3, c("gene_id","locus","value_1","value_2")]
100 |
101 | #Rename some of the columns from ugly names to more human readable names
102 | names(all_fpkm) = c("tracking_id", "condition", "replicate", "raw_frags", "internal_scaled_frags", "external_scaled_frags", "FPKM", "effective_length", "status")
103 | names(tn_de) = c("test_id", "gene_id", "gene_name", "locus", "sample_1", "sample_2", "status", "value_1", "value_2", "fold_change", "test_stat", "p_value", "q_value", "significant")
104 | names(tn_fpkm) = c("tracking_id", "class_code", "nearest_ref_id", "gene_id", "gene_name", "tss_id", "locus", "length", "coverage", "UHR_FPKM", "UHR_conf_lo", "UHR_conf_hi", "UHR_status", "HBR_FPKM", "HBR_conf_lo", "HBR_conf_hi", "HBR_status")
105 |
106 | #Get ID to gene name mapping
107 | gene_mapping=tn_fpkm[,"gene_name"]
108 | names(gene_mapping)=tn_fpkm[,"tracking_id"]
109 |
110 | #Reformat per-replicate FPKM data into a standard matrix
111 | UHR_1=all_fpkm[all_fpkm[,"condition"]=="UHR" & all_fpkm[,"replicate"]==0,"FPKM"]
112 | UHR_2=all_fpkm[all_fpkm[,"condition"]=="UHR" & all_fpkm[,"replicate"]==1,"FPKM"]
113 | UHR_3=all_fpkm[all_fpkm[,"condition"]=="UHR" & all_fpkm[,"replicate"]==2,"FPKM"]
114 | HBR_1=all_fpkm[all_fpkm[,"condition"]=="HBR" & all_fpkm[,"replicate"]==0,"FPKM"]
115 | HBR_2=all_fpkm[all_fpkm[,"condition"]=="HBR" & all_fpkm[,"replicate"]==1,"FPKM"]
116 | HBR_3=all_fpkm[all_fpkm[,"condition"]=="HBR" & all_fpkm[,"replicate"]==2,"FPKM"]
117 |
118 | #Add ids as row names and gene names as initial column along with all data
119 | ids=unique(all_fpkm[,"tracking_id"])
120 | gene_names=gene_mapping[ids]
121 | fpkm_matrix=data.frame(gene_names,UHR_1,UHR_2,UHR_3,HBR_1,HBR_2,HBR_3)
122 | row.names(fpkm_matrix)=ids
123 | data_columns=c(2:7)
124 | short_names=c("UHR_1","UHR_2","UHR_3","HBR_1","HBR_2","HBR_3")
125 |
126 | #Assign colors to each. You can specify color by RGB, Hex code, or name
127 | #To get a list of color names:
128 | colours()
129 | data_colors=c("tomato1","tomato2","tomato3","royalblue1","royalblue2","royalblue3")
130 |
131 | #View expression values for the transcripts of a particular gene symbol of chromosome 1. e.g. 'TST'
132 | #First determine the rows in the data.frame that match 'TST', then display only those rows of the data.frame
133 | i = which(fpkm_matrix[,"gene_names"] == "TST")
134 | fpkm_matrix[i,]
135 |
136 | #What if we want to view values for a list of genes of interest all at once?
137 | genes_of_interest = c("TST", "MMP11", "LGALS2", "ISX")
138 | i = which(fpkm_matrix[,"gene_names"] %in% genes_of_interest)
139 | fpkm_matrix[i,]
140 |
141 |
142 | #### Examine basic features of the differential expression file
143 | #In part 1 of the tutorial, cuffdiff attempted to perform a differential expression test for each row of data (i.e. each gene/transcript)
144 | #However, sometimes this test fails due to insufficient data, etc. These cases are summarized in the 'status' column
145 | #Summarize the status of all tests
146 | status_counts=table(tn_de[,"status"])
147 | status_counts
148 |
149 | #Plot #1 - Make a barplot of these status counts, first using the basic plotting functions of R, and then using the ggplot2 package
150 | barplot(status_counts, col=rainbow(6), xlab="Status", ylab="Transcript count", main="Status counts reported by Cuffdiff")
151 |
152 | #Plot #2 - Now the same idea using ggplot2
153 | Status=factor(tn_de[,"status"])
154 | qplot(Status, data=tn_de, geom="bar", fill=Status, xlab="Status", ylab="Transcript count", main="Status counts reported by Cuffdiff")
155 |
156 | #Plot #3 - Make a piechart of these status counts, first using the basic plotting functions of R, and then using the ggplot2 package
157 | pie(status_counts, col=rainbow(6), main="Status counts reported by Cuffdiff")
158 |
159 | #Plot #4 - Now the same idea using ggplot2
160 | #zz=as.data.frame(status_counts)
161 | #names(zz) = c("Status", "Count")
162 | #pp <- ggplot(zz, aes(x="", y=Count, fill=Status)) + geom_bar(width=1) + coord_polar("y")
163 | #print(pp)
164 | ### NOTE: The above needs to be updated as ggplot has changed###
165 |
166 | #Plot #5 - Make a dotchart of these status counts
167 | dotchart(as.numeric(status_counts), col=rainbow(6), labels=names(status_counts), xlab="Transcript count", main="Status counts reported by Cuffdiff", pch=16)
168 |
169 | #Each row of data represents a transcript. Many of these transcripts represent the same gene. Determine the numbers of transcripts and unique genes
170 | length(tn_de[,"gene_name"]) #Transcript count
171 | length(unique(tn_de[,"gene_name"])) #Unique Gene count
172 |
173 |
174 | #### Plot #6 - the number of transcripts per gene.
175 | #Many genes will have only 1 transcript, some genes will have several transcripts
176 | #Use the 'table()' command to count the number of times each gene symbol occurs (i.e. the # of transcripts that have each gene symbol)
177 | #Then use the 'hist' command to create a histogram of these counts
178 | #How many genes have 1 transcript? More than one transcript? What is the maximum number of transcripts for a single gene?
179 | counts=table(tn_de[,"gene_name"])
180 | c_one = length(which(counts == 1))
181 | c_more_than_one = length(which(counts > 1))
182 | c_max = max(counts)
183 | hist(counts, breaks=50, col="bisque4", xlab="Transcripts per gene", main="Distribution of transcript count per gene")
184 | legend_text = c(paste("Genes with one transcript =", c_one), paste("Genes with more than one transcript =", c_more_than_one), paste("Max transcripts for single gene = ", c_max))
185 | legend("topright", legend_text, lty=NULL)
186 |
187 |
188 | #### Plot #7 - the distribution of transcript sizes as a histogram
189 | #In this analysis we supplied Cufflinks with transcript models so the lengths will be those of known transcripts
190 | #However, if we had used a de novo transcript discovery mode, this step would give us some idea of how well transcripts were being assembled
191 | #If we had a low coverage library, or other problems, we might get short 'transcripts' that are actually only pieces of real transcripts
192 | hist(tn_fpkm[,"length"], breaks=50, xlab="Transcript length (bp)", main="Distribution of transcript lengths", col="steelblue")
193 |
194 |
195 | #### Summarize FPKM values for all 6 replicates
196 | #What are the minimum and maximum FPKM values for a particular library?
197 | min(fpkm_matrix[,"UHR_1"])
198 | max(fpkm_matrix[,"UHR_1"])
199 |
200 | #Set the minimum non-zero FPKM values for use later.
201 | #Do this by grabbing a copy of all data values, coverting 0's to NA, and calculating the minimum or all non NA values
202 | #zz = fpkm_matrix[,data_columns]
203 | #zz[zz==0] = NA
204 | #min_nonzero = min(zz, na.rm=TRUE)
205 | #min_nonzero
206 |
207 | #Alternatively just set min value to 1
208 | min_nonzero=1
209 |
210 | #### Plot #8 - View the range of values and general distribution of FPKM values for all 4 libraries
211 | #Create boxplots for this purpose
212 | #Display on a log2 scale and add the minimum non-zero value to avoid log2(0)
213 | boxplot(log2(fpkm_matrix[,data_columns]+min_nonzero), col=data_colors, names=short_names, las=2, ylab="log2(FPKM)", main="Distribution of FPKMs for all 6 libraries")
214 | #Note that the bold horizontal line on each boxplot is the median
215 |
216 | #### Plot #9 - plot a pair of replicates to assess reproducibility of technical replicates
217 | #Tranform the data by converting to log2 scale after adding an arbitrary small value to avoid log2(0)
218 | x = fpkm_matrix[,"UHR_1"]
219 | y = fpkm_matrix[,"UHR_2"]
220 | plot(x=log2(x+min_nonzero), y=log2(y+min_nonzero), pch=16, col="blue", cex=0.25, xlab="FPKM (UHR, Replicate 1)", ylab="FPKM (UHR, Replicate 2)", main="Comparison of expression values for a pair of replicates")
221 |
222 | #Add a straight line of slope 1, and intercept 0
223 | abline(a=0,b=1)
224 |
225 | #Calculate the correlation coefficient and display in a legend
226 | rs=cor(x,y)^2
227 | legend("topleft", paste("R squared = ", round(rs, digits=3), sep=""), lwd=1, col="black")
228 |
229 | #### Plot #10 - Scatter plots with a large number of data points can be misleading ... regenerate this figure as a density scatter plot
230 | colors = colorRampPalette(c("white", "blue", "#007FFF", "cyan","#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))
231 | smoothScatter(x=log2(x+min_nonzero), y=log2(y+min_nonzero), xlab="FPKM (UHR, Replicate 1)", ylab="FPKM (UHR, Replicate 2)", main="Comparison of expression values for a pair of replicates", colramp=colors, nbin=200)
232 |
233 |
234 | #### Plot all sets of replicates on a single plot
235 | #Create an function that generates an R plot. This function will take as input the two libraries to be compared and a plot name and color
236 | plotCor = function(lib1, lib2, name, color){
237 | x=fpkm_matrix[,lib1]
238 | y=fpkm_matrix[,lib2]
239 | zero_count = length(which(x==0)) + length(which(y==0))
240 | plot(x=log2(x+min_nonzero), y=log2(y+min_nonzero), pch=16, col=color, cex=0.25, xlab=lib1, ylab=lib2, main=name)
241 | abline(a=0,b=1)
242 | rs=cor(x,y, method="pearson")^2
243 | legend_text = c(paste("R squared = ", round(rs, digits=3), sep=""), paste("Zero count = ", zero_count, sep=""))
244 | legend("topleft", legend_text, lwd=c(1,NA), col="black", bg="white", cex=0.8)
245 | }
246 | #Open a plotting page with room for two plots on one page
247 | par(mfrow=c(1,2))
248 |
249 | #Plot #11 - Now make a call to our custom function created above, once for each library comparison
250 | plotCor("UHR_1", "HBR_1", "UHR_1 vs HBR_1", "tomato2")
251 | plotCor("UHR_2", "HBR_2", "UHR_2 vs HBR_2", "royalblue2")
252 |
253 |
254 | ##### One problem with these plots is that there are so many data points on top of each other, that information is being lost
255 | #Regenerate these plots using a density scatter plot
256 | plotCor2 = function(lib1, lib2, name, color){
257 | x=fpkm_matrix[,lib1]
258 | y=fpkm_matrix[,lib2]
259 | zero_count = length(which(x==0)) + length(which(y==0))
260 | colors = colorRampPalette(c("white", "blue", "#007FFF", "cyan","#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))
261 | smoothScatter(x=log2(x+min_nonzero), y=log2(y+min_nonzero), xlab=lib1, ylab=lib2, main=name, colramp=colors, nbin=275)
262 | abline(a=0,b=1)
263 | rs=cor(x,y, method="pearson")^2
264 | legend_text = c(paste("R squared = ", round(rs, digits=3), sep=""), paste("Zero count = ", zero_count, sep=""))
265 | legend("topleft", legend_text, lwd=c(1,NA), col="black", bg="white", cex=0.8)
266 | }
267 |
268 | #### Plot #12 - Now make a call to our custom function created above, once for each library comparison
269 | par(mfrow=c(1,2))
270 | plotCor2("UHR_1", "HBR_1", "UHR_1 vs HBR_1", "tomato2")
271 | plotCor2("UHR_2", "HBR_2", "UHR_2 vs HBR_2", "royalblue2")
272 |
273 |
274 | #### Compare the correlation 'distance' between all replicates
275 | #Do we see the expected pattern for all eight libraries (i.e. replicates most similar, then tumor vs. normal)?
276 |
277 | #Calculate the FPKM sum for all 6 libraries
278 | fpkm_matrix[,"sum"]=apply(fpkm_matrix[,data_columns], 1, sum)
279 |
280 | #Identify the genes with a grand sum FPKM of at least 5 - we will filter out the genes with very low expression across the board
281 | i = which(fpkm_matrix[,"sum"] > 5)
282 |
283 | #Calculate the correlation between all pairs of data
284 | r=cor(fpkm_matrix[i,data_columns], use="pairwise.complete.obs", method="pearson")
285 |
286 | #Print out these correlation values
287 | r
288 |
289 | #### Plot #13 - Convert correlation to 'distance', and use 'multi-dimensional scaling' to display the relative differences between libraries
290 | #This step calculates 2-dimensional coordinates to plot points for each library
291 | #Libraries with similar expression patterns (highly correlated to each other) should group together
292 | #What pattern do we expect to see, given the types of libraries we have (technical replicates, biologal replicates, tumor/normal)?
293 | d=1-r
294 | mds=cmdscale(d, k=2, eig=TRUE)
295 | par(mfrow=c(1,1))
296 | plot(mds$points, type="n", xlab="", ylab="", main="MDS distance plot (all non-zero genes)", xlim=c(-0.12,0.12), ylim=c(-0.12,0.12))
297 | points(mds$points[,1], mds$points[,2], col="grey", cex=2, pch=16)
298 | text(mds$points[,1], mds$points[,2], short_names, col=data_colors)
299 |
300 | #### Plot #14 - View the distribution of differential expression values as a histogram
301 | #Display only those that are significant according to Cuffdiff
302 | sig = which(tn_de[,"p_value"]<0.05)
303 | de = log2(tn_de[sig,"value_1"]+min_nonzero) - log2(tn_de[sig,"value_2"]+min_nonzero)
304 | tn_de[,"de"] = log2(tn_de[,"value_1"]+min_nonzero) - log2(tn_de[,"value_2"]+min_nonzero)
305 | hist(de, breaks=50, col="seagreen", xlab="Log2 difference (UHR - HBR)", main="Distribution of differential expression values")
306 | abline(v=-2, col="black", lwd=2, lty=2)
307 | abline(v=2, col="black", lwd=2, lty=2)
308 | legend("topleft", "Fold-change > 4", lwd=2, lty=2)
309 |
310 |
311 | #### Plot #15 - Display the grand expression values from UHR and HBR and mark those that are significantly differentially expressed
312 | x=log2(tn_de[,"value_1"]+min_nonzero)
313 | y=log2(tn_de[,"value_2"]+min_nonzero)
314 | plot(x=x, y=y, pch=16, cex=0.25, xlab="UHR FPKM (log2)", ylab="HBR FPKM (log2)", main="UHR vs HBR FPKMs")
315 | abline(a=0, b=1)
316 | xsig=x[sig]
317 | ysig=y[sig]
318 | points(x=xsig, y=ysig, col="magenta", pch=16, cex=0.5)
319 | legend("topleft", "Significant", col="magenta", pch=16)
320 |
321 | #Get the gene symbols for the top N (according to corrected p-value) and display them on the plot
322 | topn = order(abs(tn_de[,"fold_change"]), decreasing=TRUE)[1:25]
323 | topn = order(tn_de[,"q_value"])[1:25]
324 | text(x[topn], y[topn], tn_de[topn,"gene_name"], col="black", cex=0.75, srt=45)
325 |
326 |
327 | #### Write a simple table of differentially expressed transcripts to an output file
328 | #Each should be significant with a log2 fold-change >= 2
329 | sig = which(tn_de[,"p_value"]<0.05 & abs(tn_de[,"de"]) >= 2)
330 | sig_tn_de = tn_de[sig,]
331 |
332 | #Order the output by or p-value and then break ties using fold-change
333 | o = order(sig_tn_de[,"q_value"], -abs(sig_tn_de[,"de"]), decreasing=FALSE)
334 | output = sig_tn_de[o,c("gene_id","gene_name","locus","value_1","value_2","de","p_value")]
335 | write.table(output, file="SigDE_supplementary_R.txt", sep="\t", row.names=FALSE, quote=FALSE)
336 |
337 | #View selected columns of the first 25 lines of output
338 | output[1:25,c(2,4,5,6,7)]
339 |
340 | #You can open the file "SigDE.txt" in Excel, Calc, etc.
341 | #It should have been written to the current working directory that you set at the beginning of the R tutorial
342 | dir()
343 |
344 |
345 | #### Plot #16 - Create a heatmap to vizualize expression differences between the eight samples
346 | #Define custom dist and hclust functions for use with heatmaps
347 | mydist=function(c) {dist(c,method="euclidian")}
348 | myclust=function(c) {hclust(c,method="average")}
349 |
350 | main_title="sig DE Transcripts"
351 | par(cex.main=0.8)
352 | sig_genes=tn_de[sig,"test_id"]
353 | sig_gene_names=gene_mapping[sig_genes]
354 | data=log2(as.matrix(fpkm_matrix[sig_genes,data_columns])+1)
355 | heatmap.2(data, hclustfun=myclust, distfun=mydist, na.rm = TRUE, scale="none", dendrogram="both", margins=c(6,7), Rowv=TRUE, Colv=TRUE, symbreaks=FALSE, key=TRUE, symkey=FALSE, density.info="none", trace="none", main=main_title, cexRow=0.3, cexCol=1, labRow=sig_gene_names,col=rev(heat.colors(75)))
356 |
357 | dev.off()
358 |
359 | #The output file can be viewed in your browser at the following url:
360 | #Note, you must replace cbw## with your own amazon instance number (e.g., "cbw01"))
361 | #http://__YOUR_IP_ADDRESS__/workspace/rnaseq/de/tophat_cufflinks/ref_only/Tutorial_Part3_Supplementary_R_output.pdf
362 | #To exit R type:
363 | quit(save="no")
364 |
365 |
--------------------------------------------------------------------------------
/scripts/Tutorial_Module4_Part3_Supplementary_R_output.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial_v1/c76cf34be08ef87a5de5aca7c6bcc25fc42611d6/scripts/Tutorial_Module4_Part3_Supplementary_R_output.pdf
--------------------------------------------------------------------------------
/scripts/Tutorial_Module4_Part4_edgeR.R:
--------------------------------------------------------------------------------
1 | #Tutorial_Module3_Part4_edgeR.R
2 |
3 | #Malachi Griffith, mgriffit[AT]genome.wustl.edu
4 | #Obi Griffith, ogriffit[AT]genome.wustl.edu
5 | #The Genome Institute, Washington Univerisity School of Medicine
6 | #R tutorial for CBW - Informatics for RNA-sequence Analysis
7 |
8 | #######################
9 | # Loading Data into R #
10 | #######################
11 |
12 | #Set working directory where output will go
13 | working_dir = "~/workspace/rnaseq/de/tophat_counts"
14 | setwd(working_dir)
15 |
16 | #Read in gene mapping
17 | mapping=read.table("~/workspace/rnaseq/refs/hg19/genes/ENSG_ID2Name.txt", header=FALSE, stringsAsFactors=FALSE, row.names=1)
18 |
19 | # Read in count matrix
20 | dat=read.table("~/workspace/rnaseq/expression/tophat_counts/gene_read_counts_table_all.tsv", header=TRUE, stringsAsFactors=FALSE, row.names=1)
21 |
22 | #The last 5 rows are summary data, remove
23 | rawdata=dat[1:(length(rownames(dat))-5),]
24 |
25 | # Set column names (optional, but this helps keep the conditions straight)
26 | colnames(rawdata) <- c("UHR_1","UHR_2","UHR_3","HBR_1","HBR_2","HBR_3")
27 |
28 | # Check dimensions
29 | dim(rawdata)
30 |
31 | # Require at least 25% of samples to have count > 25
32 | quant <- apply(rawdata,1,quantile,0.75)
33 | keep <- which((quant >= 25) == 1)
34 | rawdata <- rawdata[keep,]
35 | dim(rawdata)
36 |
37 | #################
38 | # Running edgeR #
39 | #################
40 |
41 | # load edgeR
42 | library('edgeR')
43 |
44 | # make class labels
45 | class <- factor( c( rep("UHR",3), rep("HBR",3) ))
46 |
47 | # Get common gene names
48 | genes=rownames(rawdata)
49 | gene_names=mapping[genes,1]
50 |
51 |
52 | # Make DGEList object
53 | y <- DGEList(counts=rawdata, genes=genes, group=class)
54 | nrow(y)
55 |
56 | # TMM Normalization
57 | y <- calcNormFactors(y)
58 |
59 | # Estimate dispersion
60 | y <- estimateCommonDisp(y, verbose=TRUE)
61 | y <- estimateTagwiseDisp(y)
62 |
63 | # Differential expression test
64 | et <- exactTest(y)
65 |
66 | # Print top genes
67 | topTags(et)
68 |
69 | # Print number of up/down significant genes at FDR = 0.05 significance level
70 | summary(de <- decideTestsDGE(et, p=.05))
71 | detags <- rownames(y)[as.logical(de)]
72 |
73 |
74 | # Output DE genes
75 | # Matrix of significantly DE genes
76 | mat <- cbind(
77 | genes,gene_names,
78 | sprintf('%0.3f',log10(et$table$PValue)),
79 | sprintf('%0.3f',et$table$logFC)
80 | )[as.logical(de),]
81 | colnames(mat) <- c("Gene", "Gene_Name", "Log10_Pvalue", "Log_fold_change")
82 |
83 | # Order by log fold change
84 | o <- order(et$table$logFC[as.logical(de)],decreasing=TRUE)
85 | mat <- mat[o,]
86 |
87 | # Save table
88 | write.table(mat, file="DE_genes.txt", quote=FALSE, row.names=FALSE, sep="\t")
89 |
90 | #To exit R type the following
91 | quit(save="no")
92 |
93 |
--------------------------------------------------------------------------------
/setup/.bashrc:
--------------------------------------------------------------------------------
1 | # ~/.bashrc: executed by bash(1) for non-login shells.
2 | # see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)
3 | # for examples
4 |
5 | export PATH=/home/ubuntu/bin/samtools-1.2:/home/ubuntu/bin/bam-readcount/bin:/home/ubuntu/bin/bowtie2-2.2.6:/home/ubuntu/bin/tophat-2.1.0.Linux_x86_64:/home/ubuntu/bin/STAR-STAR_2.5.0a/source:/home/ubuntu/bin/cufflinks-2.2.1.Linux_x86_64:/home/ubuntu/bin/HTSeq-0.6.1p1/scripts:/home/ubuntu/bin/FastQC:/home/ubuntu/bin/picard-tools-1.140:/home/ubuntu/bin/samstat-1.5.1/src:/home/ubuntu/bin/bedtools2/bin:/home/ubuntu/bin/flexbar_v2.4_linux64:/home/ubuntu/bin/R-3.2.2/bin:/home/ubuntu/bin/allpathslg-52488/bin:/home/ubuntu/bin/MUMmer3.23:/home/ubuntu/workspace/data/bin:/home/ubuntu/workspace/tools/anaconda/bin/:/home/ubuntu/bin/tabix-0.2.6:/home/ubuntu/bin/gkno_launcher:/home/ubuntu/workspace/data/bin:/home/ubuntu/bin/edirect:/home/ubuntu/bin/sratoolkit.2.5.4-1-ubuntu64/bin:$PATH
6 | export RNA_HOME=~/workspace/rnaseq
7 | export LD_LIBRARY_PATH=/home/ubuntu/bin/flexbar_v2.4_linux64:$LD_LIBRARY_PATH
8 | export MANPAGER=less
9 |
10 |
11 | # If not running interactively, don't do anything
12 | case $- in
13 | *i*) ;;
14 | *) return;;
15 | esac
16 |
17 | # don't put duplicate lines or lines starting with space in the history.
18 | # See bash(1) for more options
19 | HISTCONTROL=ignoreboth
20 |
21 | # append to the history file, don't overwrite it
22 | shopt -s histappend
23 |
24 | # for setting history length see HISTSIZE and HISTFILESIZE in bash(1)
25 | HISTSIZE=1000
26 | HISTFILESIZE=2000
27 |
28 | # check the window size after each command and, if necessary,
29 | # update the values of LINES and COLUMNS.
30 | shopt -s checkwinsize
31 |
32 | # If set, the pattern "**" used in a pathname expansion context will
33 | # match all files and zero or more directories and subdirectories.
34 | #shopt -s globstar
35 |
36 | # make less more friendly for non-text input files, see lesspipe(1)
37 | [ -x /usr/bin/lesspipe ] && eval "$(SHELL=/bin/sh lesspipe)"
38 |
39 | # set variable identifying the chroot you work in (used in the prompt below)
40 | if [ -z "${debian_chroot:-}" ] && [ -r /etc/debian_chroot ]; then
41 | debian_chroot=$(cat /etc/debian_chroot)
42 | fi
43 |
44 | # set a fancy prompt (non-color, unless we know we "want" color)
45 | case "$TERM" in
46 | xterm-color) color_prompt=yes;;
47 | esac
48 |
49 | # uncomment for a colored prompt, if the terminal has the capability; turned
50 | # off by default to not distract the user: the focus in a terminal window
51 | # should be on the output of commands, not on the prompt
52 | #force_color_prompt=yes
53 |
54 | if [ -n "$force_color_prompt" ]; then
55 | if [ -x /usr/bin/tput ] && tput setaf 1 >&/dev/null; then
56 | # We have color support; assume it's compliant with Ecma-48
57 | # (ISO/IEC-6429). (Lack of such support is extremely rare, and such
58 | # a case would tend to support setf rather than setaf.)
59 | color_prompt=yes
60 | else
61 | color_prompt=
62 | fi
63 | fi
64 |
65 | if [ "$color_prompt" = yes ]; then
66 | PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
67 | else
68 | PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ '
69 | fi
70 | unset color_prompt force_color_prompt
71 |
72 | # If this is an xterm set the title to user@host:dir
73 | case "$TERM" in
74 | xterm*|rxvt*)
75 | PS1="\[\e]0;${debian_chroot:+($debian_chroot)}\u@\h: \w\a\]$PS1"
76 | ;;
77 | *)
78 | ;;
79 | esac
80 |
81 | # enable color support of ls and also add handy aliases
82 | if [ -x /usr/bin/dircolors ]; then
83 | test -r ~/.dircolors && eval "$(dircolors -b ~/.dircolors)" || eval "$(dircolors -b)"
84 | alias ls='ls --color=auto'
85 | #alias dir='dir --color=auto'
86 | #alias vdir='vdir --color=auto'
87 |
88 | alias grep='grep --color=auto'
89 | alias fgrep='fgrep --color=auto'
90 | alias egrep='egrep --color=auto'
91 | fi
92 |
93 | # some more ls aliases
94 | alias ll='ls -alF'
95 | alias la='ls -A'
96 | alias l='ls -CF'
97 |
98 | # Add an "alert" alias for long running commands. Use like so:
99 | # sleep 10; alert
100 | alias alert='notify-send --urgency=low -i "$([ $? = 0 ] && echo terminal || echo error)" "$(history|tail -n1|sed -e '\''s/^\s*[0-9]\+\s*//;s/[;&|]\s*alert$//'\'')"'
101 |
102 | # Alias definitions.
103 | # You may want to put all your additions into a separate file like
104 | # ~/.bash_aliases, instead of adding them here directly.
105 | # See /usr/share/doc/bash-doc/examples in the bash-doc package.
106 |
107 | if [ -f ~/.bash_aliases ]; then
108 | . ~/.bash_aliases
109 | fi
110 |
111 | # enable programmable completion features (you don't need to enable
112 | # this, if it's already enabled in /etc/bash.bashrc and /etc/profile
113 | # sources /etc/bash.bashrc).
114 | if ! shopt -oq posix; then
115 | if [ -f /usr/share/bash-completion/bash_completion ]; then
116 | . /usr/share/bash-completion/bash_completion
117 | elif [ -f /etc/bash_completion ]; then
118 | . /etc/bash_completion
119 | fi
120 | fi
121 |
--------------------------------------------------------------------------------
/setup/preinstall.sh:
--------------------------------------------------------------------------------
1 | #! /bin/bash
2 | #Preinstall priming script for AWS installs.
3 |
4 | #This script assumes you are logged into an Amazon AWS instance with at least one ephemeral volume present (in this case '/dev/xvdb')
5 | #For example, this will work with instance types: m3.xlarge
6 | #You may need to customize this code for your specific instance type
7 |
8 | #Unmount the current /mnt mount point that is attached to /dev/xvdb by default
9 | sudo umount /mnt
10 |
11 | #Mount ephemeral storage
12 | sudo mkfs /dev/xvdb
13 | sudo mount /dev/xvdb /workspace
14 |
15 | #Make ephemeral storage mounts persistent
16 | echo -e "LABEL=cloudimg-rootfs / ext4 defaults 0 0\n/dev/xvdb /workspace auto defaults,nobootwait 0 2" | sudo tee /etc/fstab
17 |
18 | #change permissions on required drives
19 | sudo chown -R ubuntu:ubuntu /workspace
20 |
21 |
--------------------------------------------------------------------------------
/setup/setup_mounts.sh:
--------------------------------------------------------------------------------
1 | #! /bin/bash
2 | #Preinstall priming script for AWS installs.
3 |
4 | #This script assumes you are logged into an Amazon AWS instance with at least one ephemeral volume present (in this case '/dev/xvdb')
5 | #For example, this will work with instance types: m3.xlarge
6 | #You may need to customize this code for your specific instance type
7 |
8 | #Unmount the current /mnt mount point that is attached to /dev/xvdb by default
9 | #sudo umount /mnt
10 |
11 | #Mount ephemeral storage
12 | #sudo mkfs /dev/xvdb #Don't do this if starting with an existing snapshot
13 | sudo mount /dev/xvdb /workspace
14 |
15 | #Make ephemeral storage mounts persistent
16 | echo -e "LABEL=cloudimg-rootfs / ext4 defaults 0 0\n/dev/xvdb /workspace auto defaults,nobootwait 0 2" | sudo tee /etc/fstab
17 |
18 | #change permissions on required drives
19 | sudo chown -R ubuntu:ubuntu /workspace
20 |
21 |
--------------------------------------------------------------------------------