├── .gitignore
├── DESCRIPTION
├── NAMESPACE
├── README.md
├── inst
├── doc
│ └── GRanges_and_GRangesList_slides.pdf
├── extdata
│ ├── E-MTAB-1147-toptable.csv
│ ├── csaw-data-filtered.Rds
│ └── csaw-normfacs.Rds
└── script
│ └── ChIPSeq
│ └── NFYA
│ ├── csaw.R
│ ├── preprocess_NFYA.R
│ └── setup.R
└── vignettes
├── A_Introduction.Rmd
├── B_GenomicRanges.Rmd
├── C_DifferentialExpression.Rmd
├── D_MachineLearning.Rmd
├── E_GeneSetEnrichment.Rmd
├── F_ChIPSeq.Rmd
├── GSEA
├── 2012-07-06-Gentleman-GSEA.pdf
├── Category-ribosome.png
├── GSEA2011.pdf
├── GSEAlm.pdf
├── L8_Gene_Set_Testing.pdf
├── SPIA.pdf
├── SPIA.png
├── goseq.pdf
├── subramanian-F1-part.jpg
├── subramanian-F1.jpg
├── young-et-al-gb-2010-11-2-r14-2-cropped.pdf
├── young-et-al-gb-2010-11-2-r14-2-cropped.png
├── young-et-al-gb-2010-11-2-r14-2-cropped.tiff
└── young-et-al-gb-2010-11-2-r14-2.pdf
├── I_LargeData.Rmd
└── our_figures
├── ChIPSeq-workflow.png
├── ChIPSeq_nbt-1508-F1.jpg
├── FilesToPackages.png
├── GRanges.png
├── GRangesImplementation.png
├── GRangesList.png
├── GRangesListImplementation.png
├── RangeOperations.png
├── SequencingEcosystem.png
├── SummarizedExperiment.png
├── batch_effects_nrg2825-f2.jpg
├── copy_number_QC_2.png
├── journal.pcbi.1003118.t001.png
├── nmeth.3252-F2.jpg
└── nrg2825-f2.jpg
/.gitignore:
--------------------------------------------------------------------------------
1 | vignettes/*R
2 | vignettes/*html
3 | vignettes/*\.md
4 | vignettes/AMakefile
--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
1 | Package: UseBioconductor
2 | Type: Package
3 | Title: Use R / Bioconductor for Sequence Analysis
4 | Version: 0.1.0
5 | Authors@R: c(person("Martin", "Morgan", role=c("aut", "cre"),
6 | email="mtmorgan@fhcrc.org"),
7 | person("Herve", "Pages", role="aut"))
8 | License: Artistic-2.0
9 | VignetteBuilder: knitr
10 | Description: This course is directed at intermediate users wanting to
11 | make effective use of R / Bioconductor for the analysis and
12 | comprehension of high-throughput sequence data using R and
13 | Bioconductor. The README.md file in the root directory of the
14 | package gitub repository outlines detailed content. The course
15 | combines lectures with extensive hands-on practicals; participants
16 | are required to bring a laptop with wireless internet access and a
17 | modern version of the Chrome or Safari web browser.
18 | Imports: devtools
19 | Suggests: ALL, AnnotationHub, BSgenome.Hsapiens.UCSC.hg19,
20 | BiocInstaller, BiocParallel, BiocStyle, Biostrings, ChIPseeker,
21 | DESeq2, GO.db, GenomicAlignments, GenomicRanges, GenomicFeatures,
22 | GenomicFiles, Gviz, MLSeq, PoiClaClu, RColorBrewer,
23 | RNAseqData.HNRNPC.bam.chr14, Rsamtools, ShortRead,
24 | TxDb.Hsapiens.UCSC.hg19.knownGene, VariantAnnotation, airway,
25 | class, cn.mops, csaw, dendextend, e1071, edgeR, fission,
26 | genefilter, ggplot2, goseq, gplots, httr, kernlab, knitr, limma,
27 | microbenchmark, org.Hs.eg.db, rtracklayer, shiny, sva, xtable
28 |
--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | exportPattern("^[^\\.]")
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Use R / Bioconductor for Sequence Analysis
2 | ==========================================
3 |
4 | Fred Hutchinson Cancer Research Center, Seattle, WA
5 | 6-7 April, 2015
6 |
7 | Contact: Martin Morgan
8 | ([mtmorgan@fredhutch.org](mailto:mtmorgan@fredhutch.org))
9 |
10 | This **INTERMEDIATE** course is designed for individuals comfortable
11 | using _R_, and with some familiarity with _Bioconductor_. It consists
12 | of approximately equal parts lecture and practical sessions addressing
13 | use of _Bioconductor_ software for analysis and comprehension of
14 | high-throughput sequence and related data. Specific topics include use
15 | of central Bioconductor classes (e.g., _GRanges_,
16 | _SummarizedExperiment_), RNASeq gene differential expression, ChIP-seq
17 | and methylation work flows, approaches to management and integrative
18 | analysis of diverse high-throughput data types, and strategies for
19 | working with large data. Participants are required to bring a laptop
20 | with wireless internet access and a modern version of the Chrome or
21 | Safari web browser.
22 |
23 | Registration
24 | ------------
25 |
26 | Please register [online](https://register.bioconductor.org/Seattle-Apr-2015/).
27 |
28 | Schedule (tentative)
29 | --------------------
30 |
31 | Day 1 (9:00 - 12:30; 1:30 - 5:00)
32 |
33 | - [A. Introduction](vignettes/A_Introduction.Rmd). _Bioconductor_ and
34 | sequencing work flows
35 | - [B. Genomic Ranges](vignettes/B_GenomicRanges.Rmd). Working with Genomic
36 | Ranges and other _Bioconductor_ data structures (e.g., in the
37 | [GenomicRanges](http://bioconductor.org/packages/devel/bioc/html/GenomicRanges.html).
38 | package).
39 | - [C. Differential Gene Expression](vignettes/C_DifferentialExpression.Rmd). RNA-Seq
40 | known gene differential expression with
41 | [DESeq2](http://bioconductor.org/packages/devel/bioc/html/DESeq2.html)
42 | and
43 | [edgeR](http://bioconductor.org/packages/devel/bioc/html/edgeR.html).
44 |
45 | Day 2 (9:00 - 12:30; 1:30 - 5:00)
46 |
47 | - [D. Machine Learning](vignettes/D_MachineLearning.Rmd).
48 | - [E. Gene Set Enrichment](vignettes/E_GeneSetEnrichment.Rmd).
49 | - [F. ChIP-seq](vignettes/F_ChIPSeq.Rmd) ChIP-seq with
50 | [csaw](http://bioconductor.org/packages/devel/bioc/html/csaw.html)
51 | - [I. Large Data](vignettes/I_LargeData.Rmd) -- efficient, parallel, and cloud
52 | programming with
53 | [BiocParallel](http://bioconductor.org/packages/devel/bioc/html/BiocParallel.html),
54 | [GenomicFiles](http://bioconductor.org/packages/devel/bioc/html/GenomicFiles.html),
55 | and other resources.
56 |
57 |
--------------------------------------------------------------------------------
/inst/doc/GRanges_and_GRangesList_slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/inst/doc/GRanges_and_GRangesList_slides.pdf
--------------------------------------------------------------------------------
/inst/extdata/E-MTAB-1147-toptable.csv:
--------------------------------------------------------------------------------
1 | "","baseMean","log2FoldChange","lfcSE","stat","pvalue","padj"
2 | "3183",1432.40015979451,-4.52349712874567,0.142649929348813,-31.7104757737709,1.1142183172124e-220,4.99169806111155e-218
3 | "91828",58.2842234383409,-3.47820360534797,0.375671633985405,-9.25862719111525,2.07077412382055e-20,8.43369824974187e-19
4 | "81537",1106.56971762968,-2.75754768126593,0.130103490989373,-21.1950322031802,1.06126034080731e-99,2.37722316340838e-97
5 | "4776",97.461818194366,-2.24586186354094,0.255656305034135,-8.78469186684474,1.56793739271017e-18,4.94302089387694e-17
6 | "283624",29.5352007313394,-2.24001491381734,0.405813908150946,-5.51980814069129,3.39369993363115e-08,4.34393591504787e-07
7 | "4053",135.522149417936,-2.11428205785264,0.24084465907166,-8.77861301139987,1.6550293171463e-18,4.94302089387694e-17
8 | "85446",82.5670990089107,-2.00352501351428,0.263740884783949,-7.59656590655457,3.04092459703197e-14,7.17018010247538e-13
9 | "10484",1400.24185534692,-1.85837786734961,0.121197313100289,-15.3334906510001,4.56766676123494e-53,6.82104903011085e-51
10 | "55701",344.20811516999,-1.73961736041058,0.163991210221822,-10.6079914774548,2.73566503516169e-26,1.75082562250348e-24
11 | "1112",413.054950570526,1.71225867245646,0.144915703196688,11.8155495552644,3.2442057374712e-32,2.9068083407742e-30
12 | "26020",5310.95322880689,-1.70905823813073,0.143517145685143,-11.9083906662983,1.07022910135238e-32,1.19865659351466e-30
13 | "6038",32.4914429792246,-1.69968784664343,0.366391606851103,-4.63899231003443,3.50112079027611e-06,2.58385770343617e-05
14 | "57156",5.40571279381349,-1.67288939044111,0.661374876801022,-2.52941175892959,0.0114253899724891,0.0271297790944442
15 | "731223",1.9671026203033,-1.6631520358834,0.729790796053574,-2.27894356146595,0.0226704204780558,NA
16 | "5228",24.9520902859294,-1.66212943841284,0.418983607463788,-3.9670512373363,7.27673532212441e-05,0.00037906714236183
17 | "100289511",87.8236602256995,-1.56877555472898,0.279314024795923,-5.61652983904117,1.94830617205936e-08,2.64497322752301e-07
18 | "117153",2.51057825696152,-1.56229986614111,0.728707809352641,-2.14393182849105,0.0320383494699092,0.0639014161814284
19 | "64207",837.459611647065,-1.55593023990358,0.140965473336716,-11.0376690339416,2.51472224057808e-28,1.87765927296496e-26
20 | "51062",20.0323159397424,-1.55388086562808,0.435311134374935,-3.56958676891944,0.000357544779217963,0.00146954184485915
21 | "145553",4.81655530302247,-1.54751717547855,0.671233003068312,-2.30548433763627,0.0211394629389538,0.045751108196383
22 | "10243",338.746609584515,-1.54194126450271,0.156307175438636,-9.86481433226364,5.9143766881063e-23,3.31205094533953e-21
23 | "57523",53.2587591296687,-1.50299458765496,0.299482795871286,-5.01863415319833,5.2040146572385e-07,4.69605248272374e-06
24 | "83982",10.8825096069583,-1.48573332331613,0.540152626758592,-2.75058057614545,0.0059489758844623,0.0164514888656735
25 | "623",4.44636529012901,-1.45389193760792,0.673415172028592,-2.1589830434444,0.0308514824660318,0.0619796598420729
26 | "23428",465.75556990637,-1.44283570918619,0.152805630779627,-9.44229412113101,3.64705388049678e-21,1.63388013846256e-19
27 | "80757",3.97991438538255,-1.40720034199387,0.691136219604602,-2.03606800233813,0.0417435272586461,0.0799192316746729
28 | "55320",238.966477802658,1.38896184417887,0.202430797920733,6.86141564646082,6.81814611029793e-12,1.32805628583194e-10
29 | "79686",22.1405377031261,1.38080312985046,0.387682207339192,3.56168816548851,0.00036847788118727,0.00150070991610816
30 | "2353",180.28473241364,-1.37714962702149,0.219469882000525,-6.2748911808236,3.49878909654846e-10,5.80539820464336e-09
31 | "100288846",6.13470711409441,-1.3658476601296,0.624759870613376,-2.18619620813456,0.0288012535736109,0.0586498254589895
32 | "400221",0.974675560282545,-1.34592932682023,0.694561692801581,-1.9378110551869,0.0526462785117864,NA
33 | "55812",15.7757619629031,-1.32903981572061,0.467486530113565,-2.84294782867381,0.00446983865153918,0.0128364597172407
34 | "3306",946.728724035196,-1.30836085245296,0.153920421934333,-8.50024211219445,1.8919567401043e-17,5.29747887229204e-16
35 | "5583",10.2596742151672,-1.3059745576008,0.541578416948674,-2.41142282766516,0.0158904143243273,0.0360168753156853
36 | "122769",177.230285283856,1.30022612451547,0.171848110281372,7.56613571360531,3.84490364494892e-14,8.61258416468559e-13
37 | "9369",327.512279792085,-1.2986791099135,0.153417364982469,-8.46500726995215,2.56136714674227e-17,6.74995577494434e-16
38 | "25938",2940.80984818683,-1.24943045176459,0.128083761393273,-9.75479200621142,1.75961002141004e-22,8.75894766212997e-21
39 | "7080",39.9797438561718,-1.23377091972103,0.321577710150241,-3.83661827539172,0.000124740133499521,0.000576119379461705
40 | "9472",71.9883051049858,-1.22309036440844,0.251412913655463,-4.86486691007675,1.14533850830321e-06,9.32930275854249e-06
41 | "90668",14.4440661626902,-1.22071567267618,0.483546072224984,-2.52450747259552,0.0115860541702236,0.0273186961487376
42 | "145447",2.09685833455682,1.19563219242626,0.73157338941382,1.63432980166797,0.102189619324906,NA
43 | "5687",62.354573094425,-1.1872684016993,0.287257676434532,-4.13311287773328,3.57882850787495e-05,0.000208222749549088
44 | "283547",2.40093478925935,-1.18083334540971,0.726772496244827,-1.6247633908974,0.104212984371431,NA
45 | "7253",0.879171747422991,-1.17519089359932,0.675137230259378,-1.74066966081522,0.0817414992702238,NA
46 | "122416",415.961946881745,-1.17435665390991,0.355753515247696,-3.30104019658739,0.000963270808372096,0.0034249628742119
47 | "84334",54.2512832894478,1.1433387891006,0.260016835963868,4.39717214795845,1.09670361316364e-05,7.22534145143106e-05
48 | "97",90.1519744789873,1.12694944145441,0.217735052457569,5.17578326840149,2.26956797454968e-07,2.25603810221508e-06
49 | "196872",15.7593489679742,-1.12367315956595,0.444350633099326,-2.52879837647215,0.0114453755554687,0.0271297790944442
50 | "122616",113.349294456021,-1.11472480377903,0.22260554875543,-5.00762361949812,5.5106158935075e-07,4.84069788292424e-06
51 | "145482",46.4020593503469,1.1088341336381,0.278308701288859,3.98418780477595,6.77113089376902e-05,0.000361126981001014
52 | "89932",681.292332079321,-1.09389329764706,0.142895654664534,-7.65518937727752,1.93027575976792e-14,4.80424189097793e-13
53 | "79446",200.297567743048,-1.08659232599486,0.171699779013473,-6.32844335757474,2.47646729534503e-10,4.43782939325829e-09
54 | "145483",26.2372438697388,-1.0819372654162,0.394151208445639,-2.74498020615716,0.00605145475255693,0.0166322191972117
55 | "113146",8091.8204447092,-1.06040850099806,0.117993155913062,-8.9870339749144,2.53992126641034e-19,8.75295944116794e-18
56 | "3320",9589.96294823561,1.05902675815388,0.117077327549075,9.04553238721627,1.489376137666e-19,5.56033758061975e-18
57 | "2287",149.792366848785,-1.05205839830652,0.232812801346378,-4.51890270733556,6.21609492662488e-06,4.42033417004436e-05
58 | "440193",553.897351675936,1.04518810852461,0.149294487265307,7.00084864263766,2.54416682354357e-12,5.18084880430691e-11
59 | "256369",0.701899246796241,-1.02704072659222,0.654302908901173,-1.56967164996563,0.116491519980812,NA
60 | "1033",151.151456683266,1.02235205325258,0.194235517821479,5.26346604740035,1.41364624015627e-07,1.50788932283336e-06
61 | "7011",393.149102247071,-1.01811559110806,0.151810213548844,-6.70650259496861,1.99344304318901e-11,3.72109368061948e-10
62 | "376267",20.9890059390901,1.00660938767154,0.391700951193329,2.56984157073114,0.0101745033734847,0.0248959000389302
63 | "55333",236.474895409753,-0.990285186257954,0.184922192044451,-5.35514518462939,8.54877463379526e-08,9.34110008765921e-07
64 | "9495",11.2550212796222,0.968066197098626,0.512272398980966,1.88974888950555,0.0587915521164752,0.104105199004668
65 | "122786",78.9553417891622,-0.954789277524657,0.24946765726316,-3.82730686614603,0.000129552963347783,0.000592242118161295
66 | "387990",6.16710103651015,0.952202002872035,0.606639241618118,1.56963469809863,0.116500121331762,0.189101646219671
67 | "5083",108.475141844886,0.91754560883159,0.220373762126575,4.16358826013319,3.13284844360007e-05,0.000187135480364377
68 | "57452",0.593146932225047,-0.910528037565311,0.633755269119884,-1.43671868611016,0.150797943220308,NA
69 | "122525",3.52735441658135,-0.887435057422969,0.697239620683527,-1.2727834608036,0.203094891709724,0.303288371619855
70 | "55102",412.607879129318,-0.880915780433987,0.140260751881845,-6.28055795092319,3.37360099543862e-10,5.80539820464336e-09
71 | "51016",278.963431261046,-0.876459445298556,0.171960749160328,-5.09685756533537,3.45337959606685e-07,3.29173204050628e-06
72 | "283596",77.4942204410022,0.875541553881386,0.232175763794369,3.77102906682726,0.000162575732903335,0.000714058120987196
73 | "60485",235.494633614731,0.864454530417008,0.167142424890047,5.17196355734147,2.3164676942387e-07,2.25603810221508e-06
74 | "9787",449.732534427427,0.852685479266412,0.168171590857491,5.07033010105124,3.97126295542903e-07,3.70651209173376e-06
75 | "9870",1192.41922715277,-0.849621050535841,0.118733863130534,-7.15567596416687,8.3261609148454e-13,1.77624766183369e-11
76 | "55727",396.983610074766,0.847078271118886,0.148688222217879,5.69700988069942,1.21926864055903e-08,1.70697609678264e-07
77 | "643866",29.8986180482568,-0.841230916054162,0.339840518976871,-2.47536967806778,0.0133098327193585,0.0305784874783211
78 | "10965",407.735439529025,-0.838800199151072,0.156233440131432,-5.36889028651885,7.9222592850066e-08,8.8729303992074e-07
79 | "22863",204.054348536117,0.836528646902067,0.180364617893545,4.63798641148013,3.51819910512514e-06,2.58385770343617e-05
80 | "55745",166.358100525616,-0.827760294086051,0.188966118115341,-4.3804693790704,1.18423934203653e-05,7.68897427872991e-05
81 | "79609",45.6179984660877,0.82409110067801,0.287266822033321,2.8687305232291,0.00412122756848354,0.0120673852985662
82 | "1397",1966.60401305455,-0.822542701404016,0.168169908806144,-4.89114079470778,1.00253217134125e-06,8.47423420303551e-06
83 | "4253",154.936826325271,-0.815084188732455,0.212228713883587,-3.84059335712485,0.000122737283887363,0.000572773991474363
84 | "161394",5.22258593152774,-0.806396187420136,0.637288592668533,-1.2653548120852,0.20574416610269,0.305209888788096
85 | "2342",24.5550647317514,0.806091130810016,0.390961291856742,2.06181826078421,0.0392250418626059,0.0757449084243425
86 | "55644",385.073371542473,0.800991096574132,0.147672584132224,5.42410157769659,5.82467535549846e-08,6.86698568227187e-07
87 | "283554",59.938691282558,-0.796349135190337,0.267164280867933,-2.98074702427753,0.00287546221876165,0.00876331342860692
88 | "11099",468.355626236144,0.79598401810395,0.132593638853168,6.00318405157739,1.9348507539127e-09,2.96643849785836e-08
89 | "55775",317.34226625771,0.795795937459591,0.144087883201904,5.52298999593516,3.33278718861787e-08,4.34393591504787e-07
90 | "10516",313.360473673507,-0.788295129976791,0.157116393502735,-5.01726848741007,5.24113000303989e-07,4.69605248272374e-06
91 | "64423",3154.45957727645,-0.783851075312696,0.165179220758605,-4.74545812550008,2.08034949164283e-06,1.63508170571226e-05
92 | "123036",12.0736511733874,0.781246766183047,0.498774220877891,1.56633349014706,0.117270564564647,0.189665028609971
93 | "161436",32.6827330441306,0.780582192381665,0.334955368955764,2.33040656973244,0.0197846728755009,0.0432367485279239
94 | "122706",0.477524391504373,-0.777838058677479,0.605830758098615,-1.28391972226485,0.199170045725382,NA
95 | "1053",0.500433314666908,-0.777667505227457,0.605897412145765,-1.28349699080802,0.199318013158563,NA
96 | "8812",589.050177970424,0.767404264778082,0.128940889603387,5.95159741132982,2.65537899966118e-09,3.83745094144584e-08
97 | "57820",461.320081323437,0.761338992960825,0.138424726432583,5.50002165495788,3.79744608355319e-08,4.72571068175508e-07
98 | "57570",181.225063317447,0.759943670172124,0.193798351453344,3.92131132423527,8.80683916950211e-05,0.000448348175901926
99 | "51562",80.3712155569268,0.744997291414789,0.265466738223546,2.80636774460022,0.00501034754741829,0.0142065550711607
100 | "645431",1.10140075018282,-0.738651270256046,0.721669652787289,-1.02353101228958,0.30605684462904,NA
101 | "1241",184.282018219079,0.737599497791902,0.175269626758216,4.20837033452133,2.5721906119404e-05,0.00016230160480976
102 | "1965",1002.8175409666,0.725176694297716,0.132863688005995,5.45805031593732,4.81391256493193e-08,5.82873737591758e-07
103 | "8814",65.9585869441561,-0.724716803593119,0.272062047371212,-2.66379236132222,0.00772652383936536,0.0201248521967621
104 | "10572",1533.9592112381,-0.72335246379959,0.172798275562376,-4.1861092736338,2.8377677144071e-05,0.00017415341589786
105 | "161357",3.91093614010412,0.720132613320407,0.689070596739257,1.04507813383438,0.295986859209724,0.407549440212015
106 | "23351",1551.62887257246,-0.708929251855538,0.118176328279823,-5.99891079858991,1.98645435124443e-09,2.96643849785836e-08
107 | "123099",6.91419698610375,0.706087020640966,0.62433958513667,1.13093425028689,0.258082766187041,0.362448524300296
108 | "51804",378.731637143684,0.701938692448754,0.14386571196158,4.87912430889863,1.0655789055921e-06,8.84035832787521e-06
109 | "55038",540.199366046281,0.697281116634237,0.166471422414359,4.18859349263357,2.80688672128689e-05,0.00017415341589786
110 | "7443",115.006428831965,0.69542916816118,0.212805569550322,3.26790868129385,0.00108345317494672,0.0037537156353444
111 | "60686",216.029900967751,0.695151900366022,0.168837524944053,4.11728317266184,3.83364921970008e-05,0.000216986144434838
112 | "100529261",15.2191803823681,0.68839581831613,0.435485375543253,1.58075530655278,0.113933997210282,0.18669728198366
113 | "90673",48.104431834993,-0.688274871515456,0.280867245112756,-2.4505344909092,0.0142644295788307,0.0326044104658987
114 | "81892",395.803074661241,0.687404960674608,0.139450638959126,4.92937835068718,8.24916818652695e-07,7.10697566839244e-06
115 | "55237",0.244604444377625,0.683497037359114,0.493198534099321,1.38584563842493,0.165794043181846,NA
116 | "728635",12.680881774761,-0.681218636551113,0.497535888519006,-1.36918492167242,0.170941478081073,0.260812538585099
117 | "9056",8.49718166095661,-0.677819304598832,0.560715398208492,-1.20884731677512,0.226721509350867,0.331932144409113
118 | "10598",2014.19504407581,0.671376830801339,0.10847405537726,6.18928487983942,6.04377703812324e-10,9.67004326099718e-09
119 | "400236",21.1880920520839,-0.668792765908579,0.416035525781448,-1.60753763672554,0.10793648008197,0.178537030374329
120 | "145376",1.52610582752836,0.663388347949224,0.737712673122482,0.899250307225076,0.368519349660451,NA
121 | "79696",9.65485101596621,-0.66106540118584,0.521843539707438,-1.26678851204415,0.205230920885145,0.305209888788096
122 | "10668",56.0637010937899,-0.660720344312448,0.288840443937058,-2.28749248306939,0.0221670936916351,0.0472897998754883
123 | "7051",140.268301368179,0.656900793140942,0.209462054466489,3.13613267478972,0.00171191738870911,0.00567054586670355
124 | "55030",832.784271522788,0.655091882056005,0.125664378037104,5.21302768762823,1.85783154362855e-07,1.93560123615254e-06
125 | "8747",50.8933247002735,-0.649388157655921,0.294494022406118,-2.20509792473951,0.0274472236855753,0.0561477452563367
126 | "280655",14.0014993486591,0.642737006948317,0.449238967375927,1.43072407699324,0.152509309957384,0.23641581612771
127 | "317762",1333.95111831557,0.638991793463764,0.13545977572899,4.71720693486291,2.39104461061397e-06,1.8468758371639e-05
128 | "91875",215.506706862087,0.627640208822662,0.161642032564476,3.88290223071965,0.000103217077512728,0.000491928199209595
129 | "26257",8.93400191991033,-0.620317291620155,0.535309279744913,-1.15880167800519,0.246537033382495,0.351566814757137
130 | "84684",0.777429071841636,0.619721599610084,0.659250063011809,0.940040258439814,0.347196910799642,NA
131 | "55671",625.836442042281,0.619050102819896,0.138579058003344,4.46712592609022,7.92774683923988e-06,5.45491210963185e-05
132 | "91750",72.0802359471852,0.618551207534182,0.234714480713742,2.63533466556146,0.00840543635060401,0.0215179170575463
133 | "8846",211.90687817786,0.618441499314596,0.158628478653727,3.89867887887019,9.67189383184652e-05,0.000476154773260136
134 | "80224",136.679724606725,0.617932538045701,0.203913144328084,3.0303712891185,0.00244253268868996,0.00761704448367126
135 | "80821",349.527185042906,0.617517411492699,0.149771060283887,4.12307564840771,3.73846853307419e-05,0.000214722295232979
136 | "57143",234.321216262903,-0.616132452895485,0.15772102838788,-3.90646991839442,9.36542963723545e-05,0.000466971873614538
137 | "83694",200.19809480727,0.615634577199985,0.18447082191817,3.33730055950571,0.000845964115987908,0.00308123515416734
138 | "161176",107.131534234867,0.613791305293965,0.207773831236046,2.95413191181259,0.00313549949412992,0.00942754210315575
139 | "122953",370.418506971415,0.609465408218364,0.148786535102672,4.09624034727197,4.19913975953452e-05,0.000232248717564378
140 | "25983",339.856444832978,0.607076434794237,0.147534004366222,4.11482381571707,3.87475257919354e-05,0.000216986144434838
141 | "100628307",14.6428787000668,-0.603863968450644,0.478656505942236,-1.26158103139523,0.207099585251056,0.306206647499911
142 | "56936",3.28985037391207,-0.599751783119829,0.70903649767476,-0.84586870363751,0.397625993173909,0.506229841174152
143 | "80017",95.1364434285455,-0.597380740352245,0.219089659543622,-2.72664963557216,0.0063980935334488,0.0173717933514246
144 | "6547",0.971363310048689,-0.594641566962505,0.712384317010624,-0.834720182299629,0.403875275463728,NA
145 | "100129794",21.0913652411075,-0.583254721907116,0.400852458228242,-1.45503591143007,0.145659319921442,0.228165647988833
146 | "54813",118.255348287161,-0.581245341463074,0.207417604146694,-2.80229512752445,0.00507404324876383,0.0142966753172717
147 | "79038",711.794187477751,0.579991868971388,0.145982261174355,3.97302976612117,7.09641770811084e-05,0.000374022956851018
148 | "9252",31.4082058725744,-0.578850832813967,0.347150886586917,-1.66743296698751,0.0954283431499982,0.161327915966789
149 | "652",155.215285316105,-0.575029771356739,0.186175507044211,-3.0886435089455,0.00201072555057042,0.00638868827415282
150 | "22990",4575.41546492155,-0.57376866347385,0.106085275516115,-5.4085608081085,6.35332166094378e-08,7.29817462590465e-07
151 | "11169",517.390335277691,0.57216352716212,0.14784733663199,3.86996167936595,0.000108852461787257,0.000513325293480957
152 | "26153",1508.85595650182,-0.571201924099784,0.192860958129978,-2.96172916301092,0.00305916747270283,0.00926018262007343
153 | "22938",389.279138150234,0.568098591424289,0.145440130018899,3.90606493098204,9.38113138957778e-05,0.000466971873614538
154 | "51109",2692.50482148618,-0.563820241593268,0.108550083857944,-5.19410231254288,2.05709995012853e-07,2.09450176740359e-06
155 | "9529",647.300394518713,0.562853000726294,0.125326221062931,4.49110326596112,7.0855184563955e-06,4.95986291947685e-05
156 | "5527",360.78940204713,0.562144445921013,0.147928213395034,3.8001165093493,0.000144628073814914,0.000654478556253347
157 | "317760",1.11984843894904,0.559663437011084,0.72986255237593,0.76680662021802,0.443196499682034,NA
158 | "57475",314.851584098013,0.55574832781228,0.148274744914798,3.74809835708451,0.000178180384660495,0.000767546272383669
159 | "2972",525.952338579188,-0.55393523477008,0.153952289564633,-3.59809676320223,0.000320554365092163,0.0013421341641242
160 | "8111",0.720793642787434,-0.552731884977695,0.685469876801436,-0.806354741009002,0.420038335262208,NA
161 | "56413",65.058851443831,0.552356606227336,0.260350311699041,2.12158995555899,0.0338721866052502,0.066265238424245
162 | "6710",586.584335338001,0.549730360306443,0.132562449619739,4.14695384615605,3.36927947886522e-05,0.000198610158754161
163 | "64093",11602.1575379348,-0.547994050826652,0.11969194021018,-4.57837052239582,4.68612085582908e-06,3.38610023130875e-05
164 | "10490",195.34971214517,0.544853968146428,0.166426453885026,3.27384232150277,0.00106095795658738,0.00374259184685942
165 | "91748",1332.09809198065,0.543465094520931,0.123149861689813,4.41303861054914,1.01929809532618e-05,6.81560517471833e-05
166 | "55668",275.242022255629,0.541510317767095,0.148145232211379,3.65526658998008,0.00025691486378664,0.00108966489467019
167 | "80127",79.0738051815004,0.532271715723607,0.222192063147185,2.39554783453727,0.0165955486570083,0.0371740289916985
168 | "55072",702.163628132844,-0.531879193512233,0.140659772889485,-3.78131702181919,0.000156000863185405,0.000694861454883061
169 | "283578",208.745398578219,-0.5259283183146,0.168704678005758,-3.11744952500161,0.00182423181526487,0.00592214386404828
170 | "122509",209.643535351167,-0.524701697110866,0.18475338378455,-2.84001129702039,0.00451119362081539,0.0128727053638554
171 | "283600",0.550461492546276,0.523528566766729,0.599123978763048,0.873823424406426,0.382214421749911,NA
172 | "1743",1514.19888879154,0.521718114815406,0.11122237257948,4.69076591980256,2.72184224395567e-06,2.06675478863075e-05
173 | "7453",3893.71824011719,-0.521412850943666,0.108624464573438,-4.80014196609597,1.58553194675406e-06,1.26842555740324e-05
174 | "6729",607.502675796322,0.520144647505694,0.149578226599278,3.4774088403867,0.00050628517741201,0.00195530827138431
175 | "338005",304.178617003405,0.518608606759575,0.147145242177688,3.52446738395604,0.000424335050947289,0.00168231949402111
176 | "51222",806.643348887699,-0.518000738096405,0.19059744598518,-2.71777376354078,0.00657227597902945,0.0176310158000311
177 | "1073",378.133576171255,0.515571222488357,0.170174918588575,3.02965458579023,0.00244833572689433,0.00761704448367126
178 | "145389",113.777189377681,-0.515018021904882,0.224433765175872,-2.29474393704219,0.0217478032295133,0.0467666298752613
179 | "161247",6.69726937340053,0.511963268868847,0.587668530812848,0.871176933977922,0.383657552220604,0.492488777635618
180 | "23503",1184.72909009369,0.509856846738586,0.114209722737689,4.46421578230778,8.03625444722549e-06,5.45491210963185e-05
181 | "5427",96.4388016119171,0.505927572055077,0.213973823444131,2.36443675170938,0.018057510408259,0.0398510574527096
182 | "26520",116.934542268689,0.503453042248215,0.208833542586254,2.41078629425766,0.0159181725725574,0.0360168753156853
183 | "384",306.563931996922,0.497698182536876,0.148546406906343,3.3504558804352,0.000806786605187183,0.0029626262223267
184 | "115708",501.513726289288,0.496829782751993,0.169238739486455,2.93567409128426,0.0033282379558574,0.00994033736149409
185 | "51637",396.266441921992,0.494231316267667,0.15257563473295,3.23925453190943,0.00119842573092238,0.00409843303399408
186 | "145241",6.47866414663616,-0.492916070688717,0.624615815918924,-0.789150799781699,0.430023874006447,0.536631463941193
187 | "9578",4085.72471089244,-0.489462686002605,0.125744062869859,-3.89253118462684,9.92037430051905e-05,0.000483079096373101
188 | "30001",732.457150357068,-0.486747984576763,0.142078458864643,-3.42590979988377,0.000612743830326441,0.00228757696655205
189 | "6655",180.616921806147,0.486546667773334,0.182482732945438,2.66626140413411,0.00767000286010762,0.0200945104171241
190 | "2079",948.316810885468,0.485976315344017,0.114154583698701,4.25717741327583,2.07024066338432e-05,0.000132495402456597
191 | "29986",0.380986229138297,0.485355169844338,0.560917728427151,0.865287626414136,0.38688094052525,NA
192 | "641372",0.380986229138297,0.485355169844338,0.560917728427151,0.865287626414136,0.38688094052525,NA
193 | "11183",303.832111715731,-0.484034010061898,0.165057870577578,-2.93251093309362,0.00336233083771618,0.00997565705494602
194 | "90135",343.910124938093,0.482290968033721,0.155117180368582,3.10920406680759,0.00187592103319072,0.00600294730621031
195 | "4149",394.373803558462,0.480971102165786,0.134123332850361,3.58603601583921,0.000335742609793247,0.00139271008506828
196 | "122830",180.09243729405,0.480848148255564,0.178428430117556,2.69490768897513,0.00704081340654944,0.0187755024174652
197 | "100506321",65.5654658367666,0.480323427633997,0.243033253785128,1.97636915999431,0.0481129818122861,0.089810899382934
198 | "729665",2.80467402902896,0.480065126919403,0.722675065513158,0.664289041961782,0.506505333151167,0.60998491734334
199 | "57862",411.058560782882,0.48001182869739,0.138979868696421,3.45382272410905,0.000552700563114923,0.00210849107825267
200 | "55778",117.014215438235,-0.475856371305524,0.214374664477652,-2.21974164934556,0.0264363098811795,0.0543278294805892
201 | "81693",3.68234202063747,0.47533923304985,0.692852212602537,0.686061506918408,0.492674323185692,0.598152023813523
202 | "26037",1056.20474177558,-0.474794036268472,0.117102418631082,-4.05451946952741,5.02374664275109e-05,0.000274468109262499
203 | "122622",192.428804844724,0.473014455550312,0.18099438824314,2.61342056039265,0.00896409201421164,0.0225613102380158
204 | "84837",22.9090586920735,-0.471220787974927,0.39703318770948,-1.18685490926701,0.235284855733375,0.341784998969093
205 | "10001",141.221877652282,0.470904225243499,0.18908488412972,2.49043823577373,0.0127585669344584,0.0296157408634059
206 | "79944",270.31235806466,0.469951575662422,0.149928257618952,3.13450968567126,0.00172141570953501,0.00567054586670355
207 | "51077",177.001949667295,0.468591948501353,0.176040460443726,2.661842324885,0.00777142729919607,0.0201248521967621
208 | "1396",1456.12603877695,-0.466885652297618,0.148183318873893,-3.15073016211053,0.00162862879180377,0.00544496790095588
209 | "283571",1.46861441200196,-0.462294699304538,0.735085613590269,-0.628899125159885,0.529415098841371,NA
210 | "9240",661.513926674837,-0.461015410823631,0.126154835468136,-3.65436179368703,0.000257822497399644,0.00108966489467019
211 | "55837",196.35601758628,0.458566988712574,0.168414853904732,2.72284170950844,0.00647230606453007,0.0174674284151173
212 | "394",736.627442642957,0.45791169940337,0.151878061887233,3.01499567293243,0.00256982921907283,0.00793988613892847
213 | "84932",117.733342402507,0.457445244465456,0.200658548604319,2.27971969122281,0.0226243186713669,0.0478098809659073
214 | "4860",2170.52904214006,0.457438469456008,0.109606496550986,4.17346128058406,3.00006665753648e-05,0.000181625657104911
215 | "9556",1303.36747214995,0.456100989923994,0.117373605919156,3.88589058291473,0.000101955423490575,0.000491140104556747
216 | "23256",207.911052690631,-0.455900886141015,0.205364918351304,-2.21995504295985,0.0264218195193125,0.0543278294805892
217 | "5706",58.173144324597,0.45437151094867,0.27546597101963,1.64946512001764,0.0990523832734922,0.166200253582489
218 | "4901",14.2477103398457,-0.453748095732424,0.461455892390623,-0.983296785705199,0.325461391638011,0.435426059052633
219 | "56948",417.949792978999,-0.453567892704231,0.159237559293803,-2.84837254926378,0.00439434503770809,0.0127010746896337
220 | "9692",542.69428326241,0.453464706492158,0.129965726785914,3.48910991925686,0.000484631819716128,0.00188795700202457
221 | "4140",794.24177471666,0.452621067745769,0.120691328701293,3.75023684481917,0.00017666762363128,0.000767546272383669
222 | "100128927",236.11803724886,-0.452487280292052,0.17063611834376,-2.65176730860978,0.00800717032935643,0.0206161626870786
223 | "2530",216.210938823447,-0.451305618754248,0.175162429365784,-2.57649782769229,0.00998068326332107,0.024567835725098
224 | "55449",142.102111688682,-0.450770600045154,0.188734369318723,-2.38838639550553,0.0169225393206675,0.0377178985853684
225 | "84962",577.320321451022,0.447774003821377,0.137084921208361,3.2663986664207,0.00108924784061333,0.0037537156353444
226 | "53981",1048.85142573759,0.447650389958641,0.143948679022462,3.10979157987818,0.00187219397621029,0.00600294730621031
227 | "4329",246.960164683757,0.447643191637371,0.167157558369469,2.67797158563387,0.00740695015679091,0.019519492177896
228 | "317749",82.7580300559427,-0.443834646715034,0.232039629374233,-1.9127536443321,0.0557795981137979,0.100358473714785
229 | "112399",294.78949478351,-0.443731855564627,0.155768787349169,-2.84865705842573,0.00439041772330522,0.0127010746896337
230 | "51635",802.41807128455,-0.442329115856746,0.128117489223321,-3.45252719623412,0.000555361489361194,0.00210849107825267
231 | "81542",601.231714145266,-0.441494806109607,0.127962298902718,-3.45019439237528,0.000560183021030795,0.00210892431446887
232 | "57062",2348.41650548263,0.440373505453721,0.110214277000663,3.99561216058308,6.45272980965176e-05,0.000348291922255902
233 | "8487",57.715584578397,0.439871654347625,0.272226656660104,1.61582873530581,0.106131346932452,0.176754064779698
234 | "79789",21.9986173193528,0.439062473021326,0.386454643552366,1.1361293759738,0.255902422323076,0.360516620128108
235 | "11161",715.105508913183,0.438492219432314,0.12412329240078,3.53271502029185,0.00041131559174112,0.00164526236696448
236 | "4287",70.2822404296889,0.438267727378813,0.240794327519911,1.82009157729255,0.0687450600642271,0.119835746726746
237 | "3958",846.778110957202,-0.437483576782368,0.123386351975692,-3.54563993324445,0.00039166097558298,0.00158075781136194
238 | "1690",0.242997164174289,-0.437250728136416,0.495337480560693,-0.882732975589639,0.377380561070803,NA
239 | "9766",896.296524424026,0.434938646371191,0.131684570370554,3.3028823737458,0.000956965201150858,0.0034249628742119
240 | "145216",0.253504571050542,0.432632259412664,0.496857294270321,0.870737462047373,0.383897518706941,NA
241 | "1113",0.371979067831239,0.431595071618149,0.579135247661963,0.745240551944556,0.456126312029222,NA
242 | "4707",178.791345472213,0.428691703319275,0.172375374568477,2.48696604368495,0.0128837695670494,0.0297522101342171
243 | "10038",444.471305216434,0.423792413470318,0.133689117766327,3.16998436784553,0.0015244713883889,0.00513506151878366
244 | "7127",2015.20487712645,-0.420473749742107,0.137867534887832,-3.04983874618634,0.00228964264157115,0.00722366129171743
245 | "9321",350.524805417528,-0.419944221263059,0.162687972847247,-2.5812862125792,0.0098432936930397,0.0243635114612253
246 | "51218",227.670681076639,0.419476863683742,0.162227518206043,2.58573186794962,0.0097172490168919,0.0241851531087087
247 | "145200",0.247775729041724,-0.419412921727983,0.486431191550976,-0.862224563335862,0.388563955265745,NA
248 | "84502",0.247775729041724,-0.419412921727983,0.486431191550976,-0.862224563335862,0.388563955265745,NA
249 | "51199",708.409056056689,0.418832165333009,0.129415165536475,3.23634532009283,0.00121070822841957,0.00410907035099974
250 | "328",2347.23082623196,0.418059812456214,0.105694020476213,3.95537808641027,7.64137078935848e-05,0.000393486679727885
251 | "441687",0.267502439490617,0.4178399740206,0.492348268582714,0.84866749957993,0.396066335680888,NA
252 | "196913",0.258932569471319,-0.415152415296038,0.484339485121617,-0.857151704639143,0.391361049284835,NA
253 | "645687",0.240905502892244,-0.411527229678687,0.482781165174266,-0.852409454561346,0.393986874458091,NA
254 | "26499",44.3431365180018,-0.411224791104946,0.283974626048082,-1.44810399727516,0.147587967817735,0.230381218056952
255 | "58517",851.743223471815,0.406796577377911,0.123034818136252,3.30635330339916,0.000945188279238054,0.00341487378305361
256 | "55051",258.307878536129,-0.4016385902835,0.158180570373089,-2.53911456594312,0.0111133424365118,0.0266244781366699
257 | "55012",74.2766655347405,-0.401012343044333,0.241018159165683,-1.66382626285294,0.0961471540034516,0.161932048847919
258 | "100506499",0.628547888911741,-0.398354689274887,0.66168332650011,-0.602032231009861,0.547152683076694,NA
259 | "23508",2.46558891104577,0.397856173134218,0.714924422334326,0.556501024031552,0.577868394591393,0.679488296002478
260 | "57544",160.740297547376,-0.394305770640654,0.197452292879132,-1.99696729215509,0.0458287362025286,0.0862658563812303
261 | "801",2942.34974033974,-0.393921385991628,0.104204373288416,-3.78027690739365,0.000156654033355333,0.000694861454883061
262 | "122961",81.8699095689892,-0.389208641659444,0.229016890169391,-1.69947570841421,0.0892295877544899,0.152575783641265
263 | "624",361.641418947578,0.383154578414905,0.15064475956582,2.5434311788821,0.0109769678638395,0.0264391484032264
264 | "122553",133.794357172539,-0.380699291126736,0.200769657218314,-1.89619933809404,0.0579336906853524,0.103041443496957
265 | "100129075",31.4833332449977,0.380124028092704,0.320589916708008,1.18570175879524,0.235740099735379,0.341784998969093
266 | "57596",821.832052695162,-0.379436763106851,0.178497308526073,-2.12572820419546,0.03352589709822,0.066265238424245
267 | "9112",2186.05010257752,0.377169088027411,0.149855466850467,2.51688574300574,0.0118397217109953,0.0277706561598213
268 | "5411",1519.65466877779,0.374374091520246,0.114574533283709,3.26751574534652,0.00108495831171467,0.0037537156353444
269 | "63874",497.8177544446,-0.370941277089145,0.138362432597431,-2.68093925587741,0.00734158412491473,0.0194617141299515
270 | "8906",914.253047671338,-0.370846307691758,0.140860321136195,-2.63272371311148,0.00847032195623239,0.0215608195249552
271 | "64112",285.594077916298,-0.370613192649338,0.155687950908822,-2.38048731764981,0.0172897561803309,0.0383455978652884
272 | "91833",191.915754857527,0.370061215561318,0.165262053671827,2.23923887752342,0.025140377488638,0.0521430051616196
273 | "7517",453.254617345418,0.368297140173018,0.162984740298552,2.25970320594663,0.0238396772018087,0.0496752343553968
274 | "5684",519.235637448021,0.367767182397068,0.139799037121658,2.63068465970208,0.00852130622735119,0.0215680519200753
275 | "283551",1.37277277298017,-0.365893278777356,0.734541679163339,-0.498124598176808,0.618396223912423,NA
276 | "100616300",0.126022912962788,0.365385981793238,0.374900854954253,0.974620294845219,0.32974864280122,NA
277 | "122970",0.126022912962788,0.365385981793238,0.374900854954253,0.974620294845219,0.32974864280122,NA
278 | "319103",0.126022912962788,0.365385981793238,0.374900854954253,0.974620294845219,0.32974864280122,NA
279 | "56967",0.126022912962788,0.365385981793238,0.374900854954253,0.974620294845219,0.32974864280122,NA
280 | "5283",143.858465886597,-0.364913740313636,0.191735763622523,-1.90321165660077,0.0570129327512028,0.102167175490155
281 | "5018",2809.65296920578,0.364190555633631,0.103707285785242,3.51171620080578,0.000445223161699617,0.00174964891615288
282 | "55015",259.471209211921,0.36184943005149,0.165859476781818,2.18166267657705,0.0291344374366644,0.0590598550752293
283 | "85439",850.777585128837,0.361407068016596,0.120895552075515,2.98941575444272,0.0027951151539126,0.00857679170515649
284 | "100505967",0.398320434717103,-0.357703243833274,0.571224465918088,-0.626204347284676,0.531180916159589,NA
285 | "440163",0.398320434717103,-0.357703243833274,0.571224465918088,-0.626204347284676,0.531180916159589,NA
286 | "84312",82.0916750284388,0.357363755112942,0.236035657818517,1.51402444196678,0.130019640454546,0.206556024551903
287 | "55017",545.214836837902,-0.357320488909194,0.130657945464639,-2.7347781081243,0.00624223393301568,0.0170519561097014
288 | "23034",381.58724152163,0.352678030132048,0.136341376974607,2.58672780015806,0.00968921003656277,0.0241851531087087
289 | "5720",1005.81015687042,0.35258238362895,0.122566085872568,2.87667164304758,0.00401893556258595,0.0118452837634112
290 | "10972",5587.36073511244,-0.34906005676877,0.1037717420592,-3.3637293721988,0.000768969072877389,0.00284709210453777
291 | "122618",0.118581531414837,0.348927457602054,0.36696198936965,0.95085449640554,0.341678243750561,NA
292 | "83851",0.118581531414837,0.348927457602054,0.36696198936965,0.95085449640554,0.341678243750561,NA
293 | "9362",0.118581531414837,0.348927457602054,0.36696198936965,0.95085449640554,0.341678243750561,NA
294 | "641371",332.588597786726,-0.348253466983179,0.144948807807925,-2.40259628381806,0.0162791484056602,0.0366485351042
295 | "29091",2.50448591864301,-0.348138741823103,0.716235127192141,-0.486067673318275,0.626919185849236,0.716479069541984
296 | "10548",2026.383277715,-0.347923198972176,0.124589736759815,-2.7925510400821,0.00522942168400646,0.0146423807152181
297 | "57578",3.44740527247424,-0.347192478872561,0.685031311886428,-0.506827166653839,0.612276092973096,0.706431766653722
298 | "3895",4077.9144398207,0.345812909039355,0.135475025656044,2.55259526517704,0.0106923664548349,0.0258928657933298
299 | "5529",261.171892485766,0.342789276220602,0.159939300423096,2.14324606468706,0.0320933451804049,0.0639014161814284
300 | "53349",205.457665267819,-0.342319108248156,0.169230014714872,-2.02280374923393,0.0430933849895344,0.0821524956396231
301 | "10858",0.387163594287509,-0.34128453497435,0.567808795271691,-0.60105538663072,0.547803096697344,NA
302 | "283638",1329.67069021499,0.337540045001016,0.158968432424355,2.12331492393393,0.0337274709385027,0.066265238424245
303 | "122481",1.84785934512185,-0.336254666712809,0.734586260129398,-0.457747013473377,0.647134207387507,NA
304 | "10419",2139.13743449535,0.333995498493685,0.106863809352388,3.12543133655585,0.00177544513393606,0.00580583518250626
305 | "112840",133.424955079484,-0.333279843731912,0.210155700676107,-1.5858710596938,0.112768561983092,0.185736455030975
306 | "10979",1605.59606939013,0.33260514195071,0.120750559222638,2.75448117252574,0.00587852720824391,0.0163576409272874
307 | "145508",73.9463761913324,-0.330791586181098,0.241761997337215,-1.36825303324948,0.171232886547883,0.260812538585099
308 | "55640",333.988970315349,0.328472278909357,0.144203729318107,2.27783484146074,0.0227364196150309,0.0478212018194076
309 | "26175",104.274319462524,0.326658907976218,0.218729977709926,1.49343455979969,0.135323472130859,0.213468012375439
310 | "51292",641.458908619578,-0.319241216510753,0.127061010236762,-2.51250337074991,0.0119877959877635,0.0279715239714483
311 | "51528",805.300375733519,-0.318661522541899,0.150871910443338,-2.11213287884744,0.034675053618952,0.0675409740056108
312 | "63894",262.673223524986,-0.315745806320558,0.16050965795087,-1.96714521949329,0.0491664713356335,0.0910189221419992
313 | "5875",502.022580865566,-0.31571868400896,0.160195921280548,-1.9708284798091,0.0487434979491961,0.0906103198391695
314 | "27004",5.5987695682621,0.312363207046138,0.618259288572245,0.505230107852455,0.613397225956022,0.706431766653722
315 | "6430",1127.53477989518,-0.309569783284862,0.120543231330579,-2.56812248906697,0.0102251018017035,0.0248959000389302
316 | "2764",278.162579539551,0.30917254063676,0.198260364168828,1.55942687754516,0.118895378943456,0.191601186211037
317 | "7187",812.893970465571,0.307223132905796,0.131403639215893,2.33801084002731,0.0193866858219949,0.042574682589479
318 | "2954",379.351814506592,-0.305686348826925,0.151436965313583,-2.01857154357217,0.0435317711800459,0.0826365825790703
319 | "83544",77.6975608237549,-0.303063963193237,0.256418921745299,-1.1819095140501,0.237241606509902,0.342852386182052
320 | "399671",15.1962776697668,0.301969615865272,0.445615493763338,0.67764613235294,0.497996088589293,0.602979047805414
321 | "2643",128.759295021216,0.295731941160414,0.187774557985559,1.57493083372433,0.115272410593056,0.187789236166143
322 | "1735",8.12959877458449,-0.29544925952192,0.553345334742433,-0.533932864292501,0.593387983518991,0.691253103856079
323 | "23357",619.09186752831,0.295440013159353,0.154281311564228,1.91494361931425,0.0554996942261681,0.100257512150497
324 | "2877",1.21412409334817,0.292986260182569,0.732772124574733,0.399832704270246,0.689279740945444,NA
325 | "2100",5.81005327996789,0.292346558159733,0.606847345452377,0.481746456255488,0.629986070019345,0.718152059462255
326 | "23224",4152.34332616615,0.290004963185434,0.127070630367674,2.28223439473242,0.0224755054746285,0.0477205045148511
327 | "64398",799.083464528065,-0.289890574513772,0.126394894150815,-2.29353073525163,0.0218174679552,0.0467666298752613
328 | "51550",208.776588999014,0.28888029088555,0.160436641726452,1.80058799397022,0.0717678429125814,0.124620130328824
329 | "64582",31.9129063238833,-0.287166594415887,0.382217158993499,-0.751317903079207,0.452461357233016,0.561503290970613
330 | "122945",2.83728327830191,0.287068809413578,0.727491884624262,0.394600703431687,0.693137594312509,0.766729980869146
331 | "27109",70.0454353579069,-0.28609436091265,0.243339558296034,-1.17570017351886,0.239714709266342,0.345312507238975
332 | "4323",4.79040587745618,0.279705497887444,0.635834394437104,0.43990306333627,0.660007317113008,0.737364783208547
333 | "55172",203.4891713324,0.279560653141986,0.16647089473391,1.67933652059023,0.0930864772545196,0.158565558213022
334 | "207",2032.32652122213,0.279356468078045,0.139799978705281,1.99825830207721,0.045688663654604,0.0862658563812303
335 | "22846",55.0025234757467,-0.279130114378276,0.270401218874486,-1.03228127276986,0.301940371426092,0.412568082775539
336 | "51004",349.265901130191,-0.277673769261555,0.144347366198355,-1.92364971093403,0.0543985055122671,0.0990671970304702
337 | "51527",156.900063151762,-0.277182928274608,0.185963005245897,-1.49052725787095,0.136085654853771,0.213917099559611
338 | "58533",399.086536707901,0.276385137450703,0.139580564328385,1.98011190727432,0.0476909554715122,0.0893955985407426
339 | "11177",842.456675402816,0.273282413946731,0.14194863713897,1.92522041391058,0.0542017892643689,0.0990671970304702
340 | "283",22.4869525319274,-0.270541506663342,0.393627131202274,-0.687304012396233,0.491891170894737,0.598152023813523
341 | "3831",973.30817353131,-0.270311180168495,0.119506312742922,-2.26189875634418,0.0237036592358573,0.0496226137274022
342 | "122809",324.376854511496,-0.268958086546331,0.153080576220792,-1.75697069599742,0.0789228041838935,0.135990062593786
343 | "379025",62.666364319544,0.268711542563852,0.257428957276012,1.04382795706911,0.296564994439993,0.407549440212015
344 | "29082",42.1993161033554,-0.26632031596055,0.298656812288683,-0.891726908620198,0.372539326875431,0.483761212870125
345 | "5890",45.4322734370428,-0.265186486781387,0.313591593972598,-0.845642842086384,0.397752018065405,0.506229841174152
346 | "10640",291.45504854957,0.260541830799317,0.180561746257476,1.44295143461778,0.149034152533451,0.231830903940923
347 | "5721",76.7400952701813,0.259818561989378,0.24240349807465,1.07184328631002,0.283790427063313,0.392401578161618
348 | "8788",0.623350477155508,-0.25979204720537,0.633996707781716,-0.409768763806918,0.681975581974887,NA
349 | "1734",941.981114108177,-0.257092167943856,0.135597550060593,-1.89599419627399,0.057960811967038,0.103041443496957
350 | "29954",1514.34557881554,0.255442136886182,0.12021975779262,2.12479330832475,0.0336038634455599,0.066265238424245
351 | "27141",427.826922792813,0.255363576012404,0.133314057799741,1.91550373776786,0.0554282929245971,0.100257512150497
352 | "9878",815.01466464375,0.253951670707181,0.122166819533647,2.07872867343688,0.0376422959611364,0.0730032406519008
353 | "171546",1059.62168396793,-0.252580377898037,0.116393766410897,-2.17005073112222,0.0300030029864114,0.0605466006212267
354 | "55147",1571.99344008633,0.250910940915595,0.108028085838491,2.32264543954543,0.0201982075398188,0.0439261989215476
355 | "7597",245.070544900869,-0.250753250007562,0.164269483274881,-1.52647494232366,0.126891626248814,0.203026601998102
356 | "647310",7.92097100229312,-0.250572434077565,0.550880221881313,-0.454858287018971,0.64921119389667,0.730096602147412
357 | "145567",66.3122284623466,0.250375910020869,0.255056549195962,0.981648621884644,0.32627298501524,0.435426059052633
358 | "9950",210.514321001507,0.246503823030538,0.176857704693947,1.39379747948846,0.163378769513036,0.250663317609042
359 | "5700",241.095835856152,0.243650082656849,0.178284953130114,1.36663290075316,0.171740399291527,0.260812538585099
360 | "91612",176.322010522042,-0.24275085805165,0.193300946194296,-1.2558182607531,0.209181869379141,0.307884745986296
361 | "55218",647.67921470333,-0.242427571634415,0.132697857995435,-1.82691397808965,0.0677127042389817,0.118961927447309
362 | "161253",5.8779934560187,0.240548437603009,0.600800261741193,0.400380047947833,0.688876617439522,0.763902783695312
363 | "10202",53.4125358228483,-0.23839642393,0.288982791731122,-0.824950241853193,0.409399840621614,0.518110532764077
364 | "10577",1183.45168902212,-0.23645231732306,0.120543110437632,-1.96155812194177,0.0498139519358838,0.0918380677665676
365 | "1778",33118.8415116603,-0.236336348975265,0.116020198428026,-2.03702762258142,0.0416472739933071,0.0799192316746729
366 | "3091",2789.35036249145,-0.235855338348046,0.129326292024166,-1.82372303927166,0.0681939539380789,0.119339419391638
367 | "7043",157.407024331088,0.235153313352927,0.187426740476571,1.25464121477544,0.209609034655849,0.307884745986296
368 | "114088",0.860660299765207,-0.233822299698246,0.712602530870741,-0.328124430617071,0.742817582715481,NA
369 | "5891",81.8415374666515,-0.233618513548814,0.229532060911059,-1.01780340672904,0.308771392728889,0.419180557401643
370 | "123016",226.775921080972,-0.232073414175784,0.166391556797132,-1.39474272999761,0.163093434463717,0.250663317609042
371 | "599",605.273181172259,0.231068746765875,0.131287485197002,1.7600211202092,0.0784042256525486,0.135618120047652
372 | "9147",441.831968171799,0.229901340929063,0.145538753777707,1.57965720443237,0.114185391213221,0.18669728198366
373 | "145438",1.87102942734362,-0.22894723518139,0.734181889274388,-0.311839938475825,0.75516216990453,NA
374 | "55148",504.053772523441,0.22797591878988,0.131062771618923,1.73944069680398,0.081957275001787,0.140677621459006
375 | "4738",1068.17611216943,0.225504345876595,0.116415662600374,1.93706191108232,0.0527377758617423,0.0968300146969694
376 | "100506412",0.129466284735659,-0.224564553683371,0.374514863528414,-0.599614529494724,0.548763161715977,NA
377 | "84659",0.129466284735659,-0.224564553683371,0.374514863528414,-0.599614529494724,0.548763161715977,NA
378 | "5265",0.127374623453615,-0.221913848739733,0.372544065187931,-0.595671410381447,0.551394760947956,NA
379 | "84439",0.127374623453615,-0.221913848739733,0.372544065187931,-0.595671410381447,0.551394760947956,NA
380 | "57447",155.170394903188,0.215671408310248,0.189230816697441,1.13972666859586,0.254400192988979,0.359530872110608
381 | "161145",1.64746279960457,0.210545927663244,0.737203146861455,0.285600961634002,0.775183766845748,NA
382 | "100616158",0.118309444306065,-0.209906736415662,0.363393776681876,-0.577628869520843,0.563514711598653,NA
383 | "100886964",0.118309444306065,-0.209906736415662,0.363393776681876,-0.577628869520843,0.563514711598653,NA
384 | "145407",0.118309444306065,-0.209906736415662,0.363393776681876,-0.577628869520843,0.563514711598653,NA
385 | "145474",0.118309444306065,-0.209906736415662,0.363393776681876,-0.577628869520843,0.563514711598653,NA
386 | "27133",0.118309444306065,-0.209906736415662,0.363393776681876,-0.577628869520843,0.563514711598653,NA
387 | "338917",0.118309444306065,-0.209906736415662,0.363393776681876,-0.577628869520843,0.563514711598653,NA
388 | "253959",201.648717330029,0.208030919542369,0.179772739363181,1.15718834946438,0.247195416626112,0.351566814757137
389 | "90141",24.4837270753047,-0.207536622482438,0.380438360033124,-0.545519706436459,0.585396130838795,0.686537870721937
390 | "78990",46.8164519399536,-0.205515825650427,0.285836954780944,-0.718996694489441,0.472142956361349,0.577400352774452
391 | "100506071",0.113530879438629,-0.203330446356014,0.358231568779085,-0.567594997417451,0.57031000478715,NA
392 | "382",737.83712827266,-0.20310917654544,0.12486006866273,-1.62669441656383,0.10380201518375,0.173519786575822
393 | "115817",616.477938378626,0.202837556612749,0.150776665213746,1.34528480468247,0.178533278750041,0.270212530000062
394 | "22985",3016.39039310875,0.199127317799384,0.107598175407943,1.85065701202108,0.0642189121468099,0.113268002526657
395 | "6495",66.5246849241116,-0.196945138529458,0.254656990623484,-0.773374169102023,0.439300972596241,0.546685654786434
396 | "254170",204.062390446734,-0.192642813711905,0.165360464660124,-1.16498713345936,0.244024201035843,0.350393724564288
397 | "4331",119.188578115546,0.191992191002935,0.220311920162178,0.871456210184194,0.383505105731869,0.492488777635618
398 | "1983",2444.68667469688,0.190505368999447,0.123632788380176,1.54089680816415,0.12334190279656,0.198054381551466
399 | "9517",1554.26535349477,0.179035005224858,0.118015794375216,1.51704274985125,0.129255891035637,0.206073449053258
400 | "91754",1648.87606616573,-0.178101601751243,0.110811206153334,-1.60725262303161,0.107998962570186,0.178537030374329
401 | "112752",185.146903673386,-0.1733811719961,0.17673074243173,-0.981047041451073,0.326569544289475,0.435426059052633
402 | "9786",254.146157455785,-0.172972204386026,0.170607786585387,-1.0138587918404,0.31065013695031,0.420456982941809
403 | "57680",3836.28560711167,-0.172701650914912,0.103386594154854,-1.67044530605425,0.094831292138451,0.160925829083432
404 | "85457",282.509170182675,0.168113338658457,0.148914126638365,1.12892807723149,0.25892817533593,0.362499445470301
405 | "400242",11.4799255027677,-0.167462340834447,0.484401159197771,-0.345710033212525,0.729560643272612,0.791387816431308
406 | "9895",739.002073124613,-0.167068533641863,0.138592073875687,-1.20546961287063,0.228022061943711,0.332748807005806
407 | "112487",50.9102726583708,0.16658027277362,0.298903810055551,0.557303945850208,0.577319780212899,0.679488296002478
408 | "22795",4.82374320650146,-0.165852479361822,0.640673377194918,-0.258872126211923,0.79573390531283,0.850808567017059
409 | "57697",164.260380319255,0.165150594646287,0.184429090933219,0.895469330844818,0.370536253306437,0.48255884151536
410 | "8650",1103.36769567791,0.163975970335274,0.115824861222939,1.41572343453669,0.156856494325419,0.242316239509614
411 | "27030",232.701123222174,0.16274404274992,0.162193785705202,1.00339259018047,0.31567147785238,0.424687153386986
412 | "26094",119.974070654714,0.161823078151544,0.192622976148723,0.840102678231912,0.400850818589996,0.508728517643961
413 | "25831",2867.30747347967,-0.161292392360617,0.107289157868018,-1.50334288725644,0.132750646098509,0.210149432693045
414 | "79711",2864.04512166258,0.158924993121608,0.119251686517127,1.33268549706241,0.182635034591888,0.275361050363322
415 | "400224",1.73273859901924,-0.157850265912551,0.735174016495475,-0.214711432083811,0.829992308934793,NA
416 | "161159",0.270945811263489,-0.156605510544756,0.501182926581317,-0.312471758790743,0.754682022553883,NA
417 | "10175",1314.19330376357,-0.15587643952223,0.117270045522131,-1.32920933754403,0.18377891530945,0.275361050363322
418 | "3705",1366.66868550646,-0.154821736973523,0.153659147276219,-1.00756602986489,0.313662843842326,0.423255885666753
419 | "4247",534.462748965834,-0.153984635171658,0.134091927752396,-1.1483512673186,0.250823582856971,0.355597990885833
420 | "11198",1990.51963834867,0.151330038893955,0.113729947102751,1.33060854022233,0.183317850943794,0.275361050363322
421 | "87",8204.7123594792,-0.15001994466855,0.135274822627486,-1.10900122990121,0.267429651411738,0.372076036746766
422 | "677811",1.71507972613353,0.149731692304368,0.736963608364107,0.203173793936364,0.838999195050402,NA
423 | "10901",212.76436235429,-0.147289830847493,0.163451780215775,-0.901120995152534,0.367523991100263,0.481434935710286
424 | "64745",985.305811169643,0.144264717482314,0.124263405274463,1.16095899000735,0.245658577399199,0.351566814757137
425 | "6617",67.4939808027702,-0.143868420837191,0.269930810323029,-0.532982584185266,0.594045636126318,0.691253103856079
426 | "84193",500.130667603109,0.1425945469339,0.1381691972525,1.03202848224783,0.302058774889234,0.412568082775539
427 | "8106",91.419711245285,-0.140177510122818,0.219012951480193,-0.640042103334219,0.522145227344225,0.622130483644183
428 | "51339",0.255010405966459,-0.139881212499408,0.493211191687888,-0.283613216522318,0.776706807527367,NA
429 | "90809",772.281353977415,0.13892860766134,0.12501173555128,1.11132452524308,0.266428686470379,0.37183816678732
430 | "647286",0.259788970833894,-0.13743147217625,0.485834713889095,-0.282877011970006,0.777271116956159,NA
431 | "57594",275.328814476861,-0.13602730135369,0.150334561636116,-0.904830531803746,0.365555174003825,0.480260169952239
432 | "8816",708.727474955544,-0.135518750562175,0.124687489676447,-1.0868672624161,0.27709547628496,0.384330567726508
433 | "23588",516.196297186756,0.133554514156424,0.129812165917459,1.02882894844651,0.303560060953739,0.413358380873177
434 | "11154",43.9827922862683,-0.133234621565126,0.295313616606317,-0.451163150200219,0.651871966203047,0.730096602147412
435 | "256281",148.196016291302,0.130945380058728,0.197903335195332,0.661663331390977,0.508187009844328,0.610369384477906
436 | "55860",192.696379181633,-0.130847310829271,0.182340320607231,-0.717599433814325,0.473004306848714,0.577400352774452
437 | "100506433",1.35528444297627,0.129427338702992,0.735439696427098,0.175986337604257,0.860304675940042,NA
438 | "3429",7.64892592330244,-0.128937877001941,0.546872294609916,-0.235773284316611,0.813608599811959,0.863736143876203
439 | "5494",364.538884054995,0.127662058248495,0.142256272759981,0.897408991334183,0.369500716460388,0.48255884151536
440 | "101101776",25.4903279260031,0.124078231372693,0.355962406069729,0.348571167227101,0.727411276866134,0.79097148552434
441 | "414763",25.4903279260031,0.124078231372693,0.355962406069729,0.348571167227101,0.727411276866134,0.79097148552434
442 | "100302145",0.236890975720902,0.12271407934208,0.480712422765777,0.255275448543736,0.798510361126856,NA
443 | "64841",488.032864647589,-0.121354941704192,0.147578881608759,-0.822305606203955,0.410902986824473,0.518547994640462
444 | "26030",1248.88539383753,0.120961049614295,0.14121920710095,0.856548143113605,0.391694655309712,0.501369158796432
445 | "643382",31.8368684211581,0.119459149467213,0.331175183490832,0.360712865644174,0.718314105958838,0.78873705752343
446 | "283635",168.526110583332,-0.117308978464313,0.184328369156869,-0.636413043748459,0.524507241360299,0.623287119706668
447 | "64430",587.538388638622,-0.116951111806399,0.144336330191908,-0.81026801534237,0.417786153917509,0.52575336223327
448 | "5641",1222.23822512151,-0.115310790280814,0.124593371779799,-0.925496987790081,0.354707449831108,0.467379228012754
449 | "54207",0.245956154868452,0.11506194435816,0.485961104184167,0.236771921389318,0.812833735712259,NA
450 | "23093",538.97983645015,-0.114542255434352,0.131452999725605,-0.871355204319774,0.383560236792584,0.492488777635618
451 | "5693",3002.29692720442,-0.114301569110141,0.119935329756914,-0.953026679809936,0.340576553021612,0.450083468299948
452 | "6554",0.890852916646125,0.113774306029615,0.695949159940928,0.163480772128918,0.870139911102289,NA
453 | "51382",290.267324962901,0.113495478930463,0.156515258131433,0.725140029703401,0.468366129040498,0.575877733346636
454 | "57096",2.23860210229564,-0.112831325221111,0.730430646751621,-0.154472331798913,0.877237325381973,NA
455 | "93487",1201.03950847312,-0.108016208433205,0.112029522394436,-0.964176282506131,0.334957553133846,0.443967407704033
456 | "374569",6.74847143752185,-0.10727236010835,0.608563294812328,-0.176271492255265,0.860080657221128,0.898172807540945
457 | "6166",1274.39217432313,-0.107089685529642,0.110416999409593,-0.969865927368589,0.33211332568701,0.441503768272345
458 | "677",934.009599185288,0.10660317002246,0.145300668846361,0.733672947749338,0.46314808978441,0.573177746473524
459 | "6815",167.426508034665,0.105741759949831,0.176701625932274,0.598419846970506,0.549559826164749,0.651330164343406
460 | "84520",102.424327106805,0.103339794104056,0.207972527595384,0.496891562067786,0.619265522132918,0.711361420296275
461 | "23186",817.135573336417,-0.103112104476238,0.130159500999712,-0.792198062256449,0.428245198281421,0.535904605670604
462 | "4792",973.472293334456,-0.102144364077829,0.115892039188458,-0.881375155645741,0.378114800118106,0.48958216893905
463 | "2009",4.33486237140333,-0.098209686897393,0.652282890018474,-0.15056302779092,0.880320428387475,0.912924888698122
464 | "55632",103.96944592161,-0.0980592374483497,0.242409748993198,-0.404518538778328,0.685831448075064,0.762413123418433
465 | "2957",438.054205665003,-0.0979111949084367,0.14563143494875,-0.672321844132714,0.501378826272977,0.605438582669256
466 | "23405",704.60037056881,0.095921830167532,0.132524600403796,0.723803956965444,0.469186099713219,0.575877733346636
467 | "3169",370.470876131861,-0.0958226705228631,0.147952284273907,-0.64765928416127,0.517205344055783,0.617887984365308
468 | "54916",195.412037357972,-0.095189748309517,0.182867874976168,-0.52053838500569,0.602688381347239,0.69949325089006
469 | "51466",652.693608509333,0.0911461562724757,0.138897441879216,0.656211914627887,0.511687783829496,0.612930821271696
470 | "283643",382.099819729571,0.0892913587298993,0.172466254403725,0.517732347343021,0.604645025907804,0.699950831025055
471 | "55334",1810.10842247537,0.0875212722139334,0.109385809636977,0.800115412633443,0.423643932123837,0.531631601096579
472 | "64403",1334.11510565657,0.0867598515087949,0.147326457340707,0.588895253947185,0.555931541704819,0.657143352727596
473 | "56252",1279.79272180009,-0.0821661243693046,0.1127163385,-0.728963746185784,0.466023839265818,0.575147878763323
474 | "1152",13757.2273555563,-0.0791357679730935,0.160172181395234,-0.494066867815338,0.621258955973696,0.711826118353493
475 | "406986",0.141479526527829,0.0778722612282298,0.38108842408569,0.204341712596129,0.838086475643729,NA
476 | "442910",0.141479526527829,0.0778722612282298,0.38108842408569,0.204341712596129,0.838086475643729,NA
477 | "6252",0.141479526527829,0.0778722612282298,0.38108842408569,0.204341712596129,0.838086475643729,NA
478 | "9623",0.141479526527829,0.0778722612282298,0.38108842408569,0.204341712596129,0.838086475643729,NA
479 | "112849",157.715515478218,0.0763780926494963,0.217529233364138,0.351116452112169,0.725500985749109,0.79097148552434
480 | "6235",1039.37062427954,0.0762277090563349,0.355844408570334,0.214216402507469,0.830378305612586,0.875316425681032
481 | "3714",4388.0410852243,0.0759439589154301,0.179796891416996,0.422387496896684,0.672742202525814,0.749722653561106
482 | "122773",0.728507111444157,-0.0754987972277864,0.692294548961066,-0.109055888625858,0.913158161916667,NA
483 | "23116",137.148299137262,-0.0751511307614764,0.192096833698781,-0.391214833240394,0.695638444958267,0.767601042712571
484 | "10914",1238.80820484706,-0.0736605166307863,0.136560120851599,-0.539399908051003,0.589610944480027,0.689675465083687
485 | "957",271.04290877112,0.0725375280553122,0.151677291025493,0.478235914980316,0.632482302826034,0.719167694583917
486 | "145501",0.127481658087755,0.0723380870306647,0.368577811677791,0.196262728625407,0.844404532273136,NA
487 | "2003",0.127481658087755,0.0723380870306647,0.368577811677791,0.196262728625407,0.844404532273136,NA
488 | "23368",297.253914660692,-0.0708978242166742,0.150943147620043,-0.469698859037573,0.638570183144849,0.724251752022512
489 | "145226",1.59843573157565,-0.0683420333574104,0.736804916536776,-0.0927545837759069,0.926098532294316,NA
490 | "8110",2.7714974915651,-0.0681216907980659,0.713646181846261,-0.0954558330597797,0.923952770121433,0.949382662877068
491 | "22890",514.116693111494,-0.0642310628035167,0.140249768018329,-0.457976249879591,0.646969504392549,0.730081455838443
492 | "54930",357.222340031938,0.0638358078625714,0.138284506431031,0.461626609589913,0.644349108931689,0.728960608084335
493 | "89874",33.6059994970774,-0.061923159058826,0.321684681112035,-0.192496449768025,0.847353348585654,0.889026464089866
494 | "6400",1767.58151093855,-0.0512132728996478,0.113508868582698,-0.451183009214262,0.651857654349898,0.730096602147412
495 | "145282",41.2473834574326,-0.050519464428872,0.303996621789149,-0.1661842955081,0.868011917964965,0.904347300577452
496 | "5926",138.622912226507,0.0471469099063212,0.192390279169804,0.245058690645743,0.806410987880845,0.860171720406235
497 | "29966",411.282379091423,0.0447827179887914,0.142959049194105,0.3132555668301,0.754086504961168,0.816016314547351
498 | "5663",1245.51865050415,0.042911138441329,0.113021009751444,0.379673996327757,0.704187425192626,0.775125224782055
499 | "26277",274.991297105689,0.0424405234986663,0.150748435045623,0.281532100056772,0.778302310994315,0.838171719532339
500 | "4522",1684.45118501255,-0.039537485091443,0.112956208340768,-0.350024896127584,0.726320013667064,0.79097148552434
501 | "4293",408.170124757446,-0.0392248853435439,0.14149146499447,-0.277224391909979,0.78160781130757,0.839712948359211
502 | "283579",1.19047521939441,-0.0378498392470862,0.728995031513934,-0.0519205723096383,0.958591982049985,NA
503 | "23241",695.422830300979,-0.0376696410013677,0.175133961563007,-0.21509044085556,0.829696806586202,0.875316425681032
504 | "9985",33.9823818516358,-0.0362674865377648,0.346546951723306,-0.104653889920007,0.916650451473542,0.946219820875914
505 | "5836",1031.59281056639,0.0354564450877206,0.11949087944086,0.296729300626405,0.766673185315384,0.827637559087451
506 | "51241",98.7652859054145,0.0346113753774838,0.227168136897131,0.152360167452344,0.878902872004628,0.912924888698122
507 | "23002",148.616912758957,-0.0343398316562254,0.190819962156977,-0.179959325366472,0.857184500463533,0.897239850952483
508 | "8892",683.894155651366,0.0329723115337303,0.124520274261662,0.264794723021921,0.791167602258127,0.847949966056558
509 | "161424",514.721182351494,-0.0326572494410727,0.150158699586574,-0.217484897851317,0.827830475117279,0.875316425681032
510 | "80344",941.694825253544,-0.0288253345735764,0.121909360417866,-0.236448903306296,0.813084352418668,0.863736143876203
511 | "79882",720.107342908215,0.0267238490637471,0.131675952510957,0.202951629011556,0.839172839779231,0.882510404274872
512 | "319085",0.656231813537233,-0.0256666741536063,0.637018657909253,-0.040291871886212,0.967860433813788,NA
513 | "57161",1.34982768807623,-0.0231911700640487,0.735795424120321,-0.0315185027030779,0.974856036451835,NA
514 | "56339",603.374717737829,0.0193340263871433,0.131666762546076,0.146840599808759,0.883257837994181,0.913855684575966
515 | "161142",157.58882347319,-0.0191748144263907,0.188331338482142,-0.101814252375257,0.918904112681231,0.94636561489929
516 | "85495",13.7772996143032,0.0184393380656492,0.476394280914595,0.0387060441411027,0.969124754494401,0.983126079997857
517 | "283629",33.6562757967114,0.0113643678999274,0.320412787147166,0.0354678975240389,0.971706644326144,0.983126079997857
518 | "5826",338.203919945445,-0.0107342510807143,0.161561773522884,-0.0664405375519961,0.947027097184921,0.966442231295774
519 | "29933",0.496501735224831,0.0101667801744459,0.620986021075184,0.016371995229205,0.986937621324533,NA
520 | "122402",655.294326289855,0.0098108257143233,0.123596835258241,0.0793776450167577,0.936732249218829,0.960311321853628
521 | "5106",1825.27637627374,0.00980207444482611,0.137940211838238,0.0710603116683696,0.943349754984168,0.964887420623076
522 | "100750247",246.098352190864,0.00620993441760628,0.216051232319948,0.0287428789501652,0.977069658024916,0.985872087376492
523 | "79178",173.982566297219,-0.00598212821225816,0.171372031002736,-0.0349072609880117,0.972153690712166,0.983126079997857
524 | "1603",1084.93813710963,0.00447564311722734,0.120648184087738,0.0370966471735086,0.970407945358049,0.983126079997857
525 | "641455",32.8667883570347,-0.00289545755851296,0.360099448921897,-0.0080407164386996,0.993584505626131,0.9980400415258
526 | "8748",0.486968692394836,0.00154223521672577,0.618731795301086,0.0024925746962386,0.998011215192587,NA
527 | "7528",927.235426959943,-0.00152172960380106,0.122980409827027,-0.0123737561611755,0.990127422931953,0.996802439266326
528 | "122704",183.118502671239,0.000740079898197707,0.173435124654353,0.00426718578299897,0.996595288678331,0.998824808339804
529 | "123096",1091.25477087083,0.000115919740109735,0.152912741355643,0.000758077705507417,0.99939514156082,0.99939514156082
530 |
--------------------------------------------------------------------------------
/inst/extdata/csaw-data-filtered.Rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/inst/extdata/csaw-data-filtered.Rds
--------------------------------------------------------------------------------
/inst/extdata/csaw-normfacs.Rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/inst/extdata/csaw-normfacs.Rds
--------------------------------------------------------------------------------
/inst/script/ChIPSeq/NFYA/csaw.R:
--------------------------------------------------------------------------------
1 | ## source("preprocess_NFYA.R", echo=TRUE)
2 | source("setup.R")
3 |
4 | library(csaw)
5 | library(edgeR)
6 |
7 | ## ext: fragment length
8 | system.time({
9 | data <- windowCounts(files$bam, width=10, ext=110)
10 | }) # 156 seconds
11 | acc <- sub(".fastq.*", "", data$bam.files)
12 | colData(data) <- cbind(files[acc,], colData(data))
13 |
14 | ## filter low count windows
15 | keep <- aveLogCPM(assay(data)) >= -1
16 | data <- data[keep,]
17 |
18 | ## normalization factor
19 | system.time({
20 | binned <- windowCounts(files$bam, bin=TRUE, width=10000)
21 | }) #139 second
22 | normfacs <- normalize(binned)
23 |
24 | ## Experimmental design
25 | design <- model.matrix(~Treatment, colData(data))
26 |
27 | ## DB windows
28 | y <- asDGEList(data, norm.factors=normfacs)
29 | y <- estimateDisp(y, design)
30 | fit <- glmQLFit(y, design, robust=TRUE)
31 | results <- glmQLFTest(fit, contrast=c(0, 1))
32 |
33 | ## Correct for multiple testing
34 | merged <- mergeWindows(rowRanges(data), tol=1000L)
35 | tabcom <- combineTests(merged$id, results$table)
36 |
--------------------------------------------------------------------------------
/inst/script/ChIPSeq/NFYA/preprocess_NFYA.R:
--------------------------------------------------------------------------------
1 | ## SystemRequirements: ascp; fastq-dump
2 |
3 | source("setup.R")
4 | library(SRAdb)
5 | library(Rsubread)
6 | library(Rsamtools)
7 | library(BiocParallel)
8 |
9 | sradb <- "SRAmetadb.sqlite"
10 | key <- "/app/aspera-connect/3.1.1/etc/asperaweb_id_dsa.openssh"
11 | cmd = sprintf("ascp -TT -l300m -i %s", key)
12 | source("setup.R")
13 |
14 | if (!file.exists(sradb))
15 | getSRAdbFile()
16 | con = dbConnect(dbDriver("SQLite"), sradb)
17 |
18 | accs <- rownames(files)[!file.exists(files$sra)]
19 | for (acc in accs)
20 | sraFiles = ascpSRA(acc, con, cmd, fileType="sra", destDir=getwd())
21 |
22 | sras <- files$sra[!file.exists(files$fastq)]
23 | bplapply(sras, function(sra) system(sprintf("fastq-dump --gzip %s", sra)))
24 |
25 |
26 | fastqs <- files$fastq[!file.exists(files$bam)]
27 | if (length(fastqs))
28 | Rsubread::align("../mm10/mm10.Rsubread.index", fastqs,
29 | nthreads=parallel::detectCores() / 2L)
30 |
31 | bams <- files$bam[!file.exists(sprintf("%s.bai", files$bam))]
32 | bams_sorted <- sub(".BAM", ".sorted.bam", bams)
33 | sorted <- bpmapply(sortBam, bams, bams_sorted)
34 | ## oops! didn't mean to do the next line
35 | file.rename(sorted, names(sorted))
36 | bplapply(sorted, indexBam)
37 |
--------------------------------------------------------------------------------
/inst/script/ChIPSeq/NFYA/setup.R:
--------------------------------------------------------------------------------
1 | files <- local({
2 | acc <- c(es_1="SRR074398", es_2="SRR074399", tn_1="SRR074417",
3 | tn_2="SRR074418")
4 | data.frame(Treatment=sub("_.*", "", names(acc)),
5 | Replicate=sub(".*_", "", names(acc)),
6 | sra=sprintf("%s.sra", acc),
7 | fastq=sprintf("%s.fastq.gz", acc),
8 | bam=sprintf("%s.fastq.gz.subread.BAM", acc),
9 | row.names=acc, stringsAsFactors=FALSE)
10 | })
11 |
--------------------------------------------------------------------------------
/vignettes/A_Introduction.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "A. Introduction: Use R / Bioconductor for Sequence Analysis"
3 | author: Martin Morgan (mtmorgan@fredhutch.org)
4 | date: "`r Sys.Date()`"
5 | output:
6 | BiocStyle::html_document:
7 | toc: true
8 | vignette: >
9 | %\VignetteIndexEntry{A. Introduction}
10 | %\VignetteEngine{knitr::rmarkdown}
11 | \usepackage[utf8]{inputenc}
12 | ---
13 |
14 | ```{r style, echo=FALSE, results='asis'}
15 | BiocStyle::markdown()
16 | ```
17 |
18 | # Introduction
19 |
20 | This **INTERMEDIATE** course is designed for individuals comfortable
21 | using _R_, and with some familiarity with _Bioconductor_. It consists
22 | of approximately equal parts lecture and practical sessions addressing
23 | use of _Bioconductor_ software for analysis and comprehension of
24 | high-throughput sequence and related data. Specific topics include use
25 | of central Bioconductor classes (e.g., _GRanges_,
26 | _SummarizedExperiment_), RNASeq gene differential expression, ChIP-seq
27 | and methylation work flows, approaches to management and integrative
28 | analysis of diverse high-throughput data types, and strategies for
29 | working with large data. Participants are required to bring a laptop
30 | with wireless internet access and a modern version of the Chrome or
31 | Safari web browser.
32 |
33 | # Schedule (tentative)
34 |
35 | Day 1 (9:00 - 12:30; 1:30 - 5:00)
36 |
37 | - [B. Genomic Ranges](B_GenomicRanges.html). Working with Genomic
38 | Ranges and other _Bioconductor_ data structures (e.g., in the
39 | [GenomicRanges](http://bioconductor.org/packages/devel/bioc/html/GenomicRanges.html).
40 | package).
41 | - [C. Differential Gene Expression](C_DifferentialExpression.html). RNA-Seq
42 | known gene differential expression with
43 | [DESeq2](http://bioconductor.org/packages/devel/bioc/html/DESeq2.html)
44 | and
45 | [edgeR](http://bioconductor.org/packages/devel/bioc/html/edgeR.html).
46 |
47 | Day 2 (9:00 - 12:30; 1:30 - 5:00)
48 |
49 | - [D. Machine Learning](D_MachineLearning.html).
50 | - [E. Gene Set Enrichment](E_GeneSetEnrichment.html).
51 | - [F. ChIP-seq](F_ChIPSeq.html) ChIP-seq with
52 | [csaw](http://bioconductor.org/packages/devel/bioc/html/csaw.html)
53 | - [I. Large Data](I_LargeData.html) -- efficient, parallel, and cloud
54 | programming with
55 | [BiocParallel](http://bioconductor.org/packages/devel/bioc/html/BiocParallel.html),
56 | [GenomicFiles](http://bioconductor.org/packages/devel/bioc/html/GenomicFiles.html),
57 | and other resources.
58 |
59 | # Bioconductor
60 |
61 | Analysis and comprehension of high throughput genomic data
62 |
63 | - Sequencing, microarrays, proteomics, flow cytometry, imaging, ...
64 | - Recent publication: Huber et al., Orchestrating high-throughput
65 | genomic analysis with Bioconductor. Nature Methods
66 | 12:[115-121](http://www.nature.com/nmeth/journal/v12/n2/abs/nmeth.3252.html)
67 |
68 | Packages
69 |
70 | - Software
71 | - Annotation -- static, versioned annotations, e.g., genome sequences,
72 | gene models, gene identifiers
73 | - Experiment data -- example experiments used to illustrate work
74 | flows and analysis
75 | - Recently: `r Biocpkg("AnnotationHub")` for easy web-based retrieval
76 | of genome-scale annotation
77 |
78 | # Sequencing work flows & ecosystem
79 |
80 | 0. Research question
81 | 1. Experimental Design
82 | 2. Wet lab
83 | 3. Sequencing
84 | 4. Alignment
85 | 5. _Reduction_
86 | 6. _Analysis_
87 | 7. _Comprehension_ (annotation, integration, visualization)
88 |
89 | 
90 |
--------------------------------------------------------------------------------
/vignettes/B_GenomicRanges.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: B. Genomic Ranges
3 | author: Martin Morgan (mtmorgan@fredhutch.org)
4 | date: "`r Sys.Date()`"
5 | output:
6 | BiocStyle::html_document:
7 | toc: true
8 | vignette: >
9 | %\VignetteIndexEntry{B. Genomic Ranges}
10 | %\VignetteEngine{knitr::rmarkdown}
11 | \usepackage[utf8]{inputenc}
12 | ---
13 |
14 | ```{r style, echo=FALSE, results='asis'}
15 | BiocStyle::markdown()
16 | suppressPackageStartupMessages({
17 | library(GenomicAlignments)
18 | library(BSgenome.Hsapiens.UCSC.hg19)
19 | library(TxDb.Hsapiens.UCSC.hg19.knownGene)
20 | })
21 | ```
22 |
23 | # Motivation
24 |
25 | Genomic ranges describe...
26 |
27 | - Annotations, e.g., exons, genes, binding sites, ..
28 | - Data, e.g., aligned reads, called peaks, copy number regions
29 | - See Lawrence et al., Software for computing and annotating genomic
30 | ranges, PLoS Comput Biol 9(8):
31 | [e1003118](https://doi.org/10.1371/journal.pcbi.1003118)
32 |
33 | Packages
34 |
35 | - `r Biocpkg("GenomicRanges")` for essential genomic ranges; depends
36 | on `r Biocpkg("IRanges")`, `r Biocpkg("S4Vectors")` and other
37 | packages
38 | - `r Biocpkg("GenomicAlignments")` for aligned reads using genomic
39 | range-based concepts. Depends on `r Biocpkg("Rsamtools")`, `r
40 | Biocpkg("Biostrings")` and other packages
41 |
42 | ```{r packages}
43 | library(GenomicRanges)
44 | library(GenomicAlignments)
45 | sessionInfo()
46 | ```
47 |
48 | ## Use cases
49 |
50 | `GRanges`: simple genomic ranges
51 |
52 | - e.g., exons, binding sites
53 | - e.g., called peaks, reads aligned without gaps
54 | - _vector_ of genomic ranges
55 | - _metadata_ `mcols()` of associated data, e.g., 'score', 'id', ...
56 |
57 | 
58 |
59 | - `seqname()`, e.g., chromosome, but could be, e.g., contig, ...
60 | - `start()`, `end()`: 1-based, closed intervals
61 | - `strand()`: +, -, or * (does not matter)
62 | - `mcols()`
63 |
64 | `GRangesList`: nested genomic ranges
65 |
66 | - e.g., exons-within-transcripts
67 | - e.g., aligned reads with gaps; paired-end reads
68 |
69 | 
70 |
71 | - Accessors resturn `*List` objects: lists, but all elements of the
72 | same type. E.g., `start()` returns an `IntegerList()`.
73 | - `unlist()`
74 |
75 | # Range-based operations
76 |
77 | 
78 |
79 | Intra-range operations
80 |
81 | - e.g., `range()`, `flank()`
82 |
83 | Inter-range operations
84 |
85 | - e.g., `reduce()`, `disjoin()`
86 |
87 | Between-object
88 |
89 | - e.g., `psetdiff()`, `findOverlaps()`, `countOverlaps()`
90 |
91 | 
92 |
93 | PLoS Comput Biol 9(8):
94 | [e1003118](https://doi.org/10.1371/journal.pcbi.1003118)
95 |
96 | # Working with _Bioconductor_ classes and methods
97 |
98 | What can I do with my `GRanges` instance?
99 |
100 | ```{r methods-class}
101 | methods(class="GRanges")
102 | ```
103 |
104 | What type of object(s) can I use `findOverlaps()` on (what _methods_
105 | exist for the `findOverlaps()` _generic_)?
106 |
107 | ```{r methods-generic}
108 | methods(findOverlaps)
109 | ```
110 |
111 | How can I get help on functions, generics, and methods?
112 |
113 | ```{r help, eval=FALSE}
114 | ?"findOverlaps" ## generic
115 | ?"findOverlaps," ## specific method
116 | ```
117 |
118 | Other help?
119 |
120 | - Vignettes, e.g., `browseVignettes("GenomicRanges")`
121 | - Work flows, vidoes, training material on the Bioconductor
122 | [web site](http://bioconductor.org/)
123 | - Questions and answers on the Bioconductor
124 | [support site](https://support.bioconductor.org)
125 |
126 | # Important parts of the sequence class menagerie
127 |
128 | `GAlignments` and friends (`r Biocpkg("GenomicAlignments")`)
129 |
130 | - `GAlignments`: Single-end aligned reads, e.g., from BAM files
131 | - `GAlignmentPairs`, `GAlignmentsList`: paired-end aligned
132 | reads. `*Pairs` is more restrictive on what pairs can be represented
133 |
134 | `DNAString` and `DNAStringSet` (`r Biocpkg("Biostrings")`)
135 |
136 | `SummarizedExperiment` (`r Biocpkg("GenomicRanges")`)
137 |
138 | 
139 |
140 | - `assays` of rows (regions of interest; genomic ranges) x columns
141 | (samples, including integrated phenotypic information)
142 |
143 | `TxDb` (`r Biocpkg("AnnotationDb")`)
144 |
145 | - Coordinating transcript-level descriptions derived, e.g., from GTF,
146 | UCSC, Biomart
147 | - `transcripts()` interface
148 | - `select()` interface
149 |
150 | `VCF` (`r Biocpkg("VariantAnnotation")`)
151 |
152 | Lower-level classes
153 |
154 | - `DataFrame` (`r Biocpkg("S4Vectors")`) (like a `data.frame`, but can
155 | contain S4 objects)
156 | - `IRanges` (`r Biocpkg("IRanges")`), `Rle` (`r Biocpkg("S4Vectors")`)
157 |
158 | 
159 |
160 | # Deeper understanding
161 |
162 | ## Classes and class hierarchies
163 |
164 | - 'Ideally', the user can remain ignorant of how classes are
165 | implemented. Actually, it sometimes helps!
166 | - Slots
167 | - Contained classes
168 |
169 | _R_ works efficiently on vectors
170 |
171 | - Represent `GRanges` as a collection of vectors, not as a collection
172 | of records
173 |
174 | ```{r getClass-GRanges}
175 | getClass("GRanges")
176 | ```
177 |
178 | 
179 |
180 | ## `Vector` and `Annotated`
181 |
182 | - `[`, `length()`, `names()`, etc.
183 | - `mcols()`
184 |
185 | ## `List`-like
186 |
187 | - `[[`
188 | - `elementLengths()`
189 |
190 | Implementation: `Vector` plus partitioning
191 |
192 | - Consequence: `unlist()` and `relist()` are very inexpensive
193 |
194 | 
195 |
196 | # Practical
197 |
198 | ## 1. Exon and transcript characterization
199 |
200 | Ingredients
201 |
202 | - `r Biocpkg("TxDb.Hsapiens.UCSC.hg19")` TxDb package
203 | - `exons()`, and `exonsBy()` functions
204 | - `width()`, `elementLengths()` accessors
205 | - `hist()`
206 |
207 | Goals
208 |
209 | - Histogram of exon widths
210 | - Histogram of transcript widths
211 | - Transcripts with exactly one exon
212 | - Transcripts with maximum number of exons
213 |
214 | ## 2. GC content
215 |
216 | Ingredients
217 | - `r Biocpkg("BSgenome.Hsapiens.UCSC.hg19")` BSGenome package
218 | - `r Biocpkg("TxDb.Hsapiens.UCSC.hg19")` TxDb package
219 | - `?"getSeq,BSgenome-method"`, `letterFrequency()`
220 |
221 | Goapls
222 |
223 | - GC content of exons, transcripts
224 | - GC content of non-coding regions
225 |
226 | ## 3. CpG islands
227 |
228 | Ingredients
229 |
230 | - `DNAStringSet()` to construct CG island sequence
231 | - `matchPDict()` to find CG islands on BSgenome chromosomes
232 | - ?? `coverage()`, `tileGenome()`, `Views()`, following
233 | [Hints](https://support.bioconductor.org/p/65281/)
234 | - ?? `tileGenome()`, `findOverlaps()`, `splitAsList()`, `mean()`
235 |
236 | Goal
237 |
238 | - Average number of CpG islands in 'tiles' across chr 17
239 |
240 | ## 4. Aligned reads
241 |
242 | Ingredients
243 |
244 | - `r Biocexptpkg("RNAseqData.HNRNPC.bam.chr14")` and
245 | `RNAseqData.HNRNPC.bam.chr14_BAMFILES`
246 | - `readGAlignments()`, `readGAlignmentPairs()`, and
247 | `readGAlignmentsList()`
248 | - `BamFile()`
249 |
250 | Goals
251 |
252 | - Compare three input methods
253 | - Use `ScanBamParam()` `which` and `what` to selective input data
254 | - `countOverlaps()` between reads and known genes
255 | - Use `BamFile()` `yieldSize` argument to iterate through file
256 |
257 |
--------------------------------------------------------------------------------
/vignettes/C_DifferentialExpression.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: C. Differential Expression
3 | author: Martin Morgan (mtmorgan@fredhutch.org)
4 | date: "`r Sys.Date()`"
5 | output:
6 | BiocStyle::html_document:
7 | toc: true
8 | vignette: >
9 | %\VignetteIndexEntry{C. Differential Expression}
10 | %\VignetteEngine{knitr::rmarkdown}
11 | \usepackage[utf8]{inputenc}
12 | ---
13 |
14 | ```{r style, echo=FALSE, results='asis'}
15 | BiocStyle::markdown()
16 | options(max.print=1000)
17 | suppressPackageStartupMessages({
18 | library(genefilter)
19 | library(airway)
20 | library(DESeq2)
21 | library(GenomicAlignments)
22 | library(GenomicFeatures)
23 | })
24 | ```
25 |
26 | # Work flow
27 |
28 | ## 1. Experimental design
29 |
30 | Keep it simple
31 |
32 | - Classical experimental designs
33 | - Time series
34 | - Without missing values, where possible
35 | - Intended analysis must be feasbile -- can the available samples and
36 | hypothesis of interest be combined to formulate a testable
37 | statistical hypothesis?
38 |
39 | Replicate
40 |
41 | - Extent of replication determines nuance of biological question.
42 | - No replication (1 sample per treatment): qualitative description
43 | with limited statistical options.
44 | - 3-5 replicates per treatment: designed experimental manipulation
45 | with cell lines or other well-defined entities; 2-fold (?)
46 | change in average expression between groups.
47 | - 10-50 replicates per treatment: population studies, e.g., cancer
48 | cell lines.
49 | - 1000's of replicates: prospective studies, e.g., SNP discovery
50 | - One resource: `r Biocpkg("RNASeqPower")`
51 |
52 | Avoid confounding experimental factors with other factors
53 |
54 | - Common problems: samples from one treatment all on the same flow
55 | cell; samples from treatment 1 processed first, treatment 2
56 | processed second, etc.
57 |
58 | Record co-variates
59 |
60 | Be aware of _batch effects_
61 |
62 | - Known
63 |
64 | - Phenotypic covariates, e.g., age, gender
65 | - Experimental covariates, e.g., lab or date of processing
66 | - Incorporate into linear model, at least approximately
67 |
68 | - Unknown
69 |
70 | - Or just unexpected / undetected
71 | - Characterize using, e.g., `r Biocpkg("sva")`.
72 |
73 | - Surrogate variable analysis
74 |
75 | - Leek et al., 2010, Nature Reviews Genetics 11
76 | [733-739](http://www.nature.com/nrg/journal/v11/n10/abs/nrg2825.html),
77 | Leek & Story PLoS Genet 3(9):
78 | [e161](https://doi.org/10.1371/journal.pgen.0030161).
79 | - Scientific finding: pervasive batch effects
80 | - Statistical insights: surrogate variable analysis: identify and
81 | build surrogate variables; remove known batch effects
82 | - Benefits: reduce dependence, stabilize error rate estimates, and
83 | improve reproducibility
84 | - _combat_ software / `r Biocpkg("sva")` _Bioconductor_ package
85 |
86 | 
87 | HapMap samples from one facility, ordered by date of processing.
88 |
89 | ## 2. Wet-lab
90 |
91 | Confounding factors
92 |
93 | - Record or avoid
94 |
95 | Artifacts of your _particular_ protocols
96 |
97 | - Sequence contaminants
98 | - Enrichment bias, e.g., non-uniform transcript representation.
99 | - PCR artifacts -- adapter contaminants, sequence-specific
100 | amplification bias, ...
101 |
102 | ## 3. Sequencing
103 |
104 | Axes of variation
105 |
106 | - Single- versus paired-end
107 | - Length: 50-200nt
108 | - Number of reads per sample
109 |
110 | Application-specific, e.g.,
111 |
112 | - ChIP-seq: short, single-end reads are usually sufficient
113 | - RNA-seq, known genes: single- or paired-end reads
114 | - RNA-seq, transcripts or novel variants: paired-end reads
115 | - Copy number: single- or paired-end reads
116 | - Structural variants: paired-end reads
117 | - Variants: depth via longer, paired-end reads
118 | - Microbiome: long paired-end reads (overlapping ends)
119 |
120 | ## 4. Alignment
121 |
122 | Alignment strategies
123 |
124 | - _de novo_
125 | - No reference genome; considerable sequencing and computational
126 | resources
127 | - Genome
128 | - Established reference genome
129 | - Splice-aware aligners
130 | - Novel transcript discovery
131 | - Transcriptome
132 | - Established reference genome; reliable gene model
133 | - Simple aligners
134 | - Known gene / transcript expression
135 |
136 | Splice-aware aligners (and _Bioconductor_ wrappers)
137 |
138 | - [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2) (`r Biocpkg("Rbowtie")`)
139 | - [STAR](http://bowtie-bio.sourceforge.net/bowtie2)
140 | ([doi](https://doi.org/10.1093/bioinformatics/bts635))
141 | - [subread](https://doi.org/10.1093/nar/gkt214) (`r Biocpkg("Rsubread")`)
142 | - Systematic evaluation (Engstrom et al., 2013,
143 | [doi](https://doi.org/10.1038/nmeth.2722))
144 |
145 | ## (5a. Bowtie2 / tophat / Cufflinks / Cuffdiff / etc)
146 |
147 | - [tophat](http://ccb.jhu.edu/software/tophat) uses Bowtie2 to perform
148 | basic single- and paired-end alignments, then uses algorithms to
149 | place difficult-to-align reads near to their well-aligned mates.
150 | - [Cufflinks](http://cole-trapnell-lab.github.io/cufflinks/)
151 | ([doi](https://doi.org/10.1038/nprot.2012.016)) takes _tophat_
152 | output and estimate existing and novel transcript abundance.
153 | [How Cufflinks Works](http://cufflinks.cbcb.umd.edu/howitworks.html)
154 | - [Cuffdiff](http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/)
155 | assesses statistical significance of estimated abundances between
156 | experimental groups
157 | - [RSEM](http://www.biomedcentral.com/1471-2105/12/323) includes de
158 | novo assembly and quantification
159 |
160 | ## 5. Reduction to 'count tables'
161 |
162 | - Use known gene model to count aligned reads overlapping regions of
163 | interest / gene models
164 | - Gene model can be public (e.g., UCSC, NCBI, ENSEMBL) or _ad hoc_ (gff file)
165 | - `GenomicAlignments::summarizeOverlaps()`
166 | - `Rsubread::featureCount()`
167 | - [HTSeq](http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html),
168 | [htseq-count](http://www-huber.embl.de/users/anders/HTSeq/doc/count.html)
169 |
170 | ## 6. Analysis
171 |
172 | Unique statistical aspects
173 |
174 | - Large data, few samples
175 | - Comparison of each gene, across samples; _univariate_ measures
176 | - Each gene is analyzed by the _same_ experimental design, under the
177 | _same_ null hypothesis
178 |
179 | Summarization
180 |
181 | - Counts _per se_, rather than a summary (RPKM, FRPKM, ...), are
182 | relevant for analysis
183 | - For a given gene, larger counts imply more information; RPKM etc.,
184 | treat all estimates as equally informative.
185 | - Comparison is across samples at _each_ region of interest; all
186 | samples have the same region of interest, so modulo library size
187 | there is no need to correct for, e.g., gene length or mapability.
188 |
189 | Normalization
190 |
191 | - Libraries differ in size (total counted reads per sample) for
192 | un-interesting reasons; we need to account for differences in
193 | library size in statistical analysis.
194 | - Total number of counted reads per sample is _not_ a good estimate of
195 | library size. It is un-necessarily influenced by regions with large
196 | counts, and can introduce bias and correlation across
197 | genes. Instead, use a robust measure of library size that takes
198 | account of skew in the distribution of counts (simplest: trimmed
199 | geometric mean; more advanced / appropriate encountered in the lab).
200 | - Library size (total number of counted reads) differs between
201 | samples, and should be included _as a statistical offset_ in
202 | analysis of differential expression, rather than 'dividing by' the
203 | library size early in an analysis.
204 |
205 | Appropriate error model
206 |
207 | - Count data is _not_ distributed normally or as a Poisson process,
208 | but rather as negative binomial.
209 | - Result of a combination Poisson (`shot' noise, i.e., within-sample
210 | technical and sampling variation in read counts) with variation
211 | between biological samples.
212 | - A negative binomial model requires estimation of an additional
213 | parameter ('dispersion'), which is estimated poorly in small
214 | samples.
215 | - Basic strategy is to moderate per-gene estimates with more robust
216 | local estimates derived from genes with similar expression values (a
217 | little more on borrowing information is provided below).
218 |
219 | Pre-filtering
220 |
221 | - Naively, a statistical test (e.g., t-test) could be applied to each
222 | row of a counts table. However, we have relatively few samples
223 | (10's) and very many comparisons (10,000's) so a naive approach is
224 | likely to be very underpowered, resulting in a very high _false
225 | discovery rate_
226 | - A simple approach is perform fewer tests by removing regions that
227 | could not possibly result in statistical significance, regardless of
228 | hypothesis under consideration.
229 | - Example: a region with 0 counts in all samples could not possibly be
230 | significant regradless of hypothesis, so exclude from further
231 | analysis.
232 | - Basic approaches: 'K over A'-style filter -- require a minimum of A
233 | (normalized) read counts in at least K samples. Variance filter,
234 | e.g., IQR (inter-quartile range) provides a robust estimate of
235 | variability; can be used to rank and discard least-varying regions.
236 | - More nuanced approaches: `r Biocpkg("edgeR")` vignette; work flow
237 | today.
238 |
239 | Borrowing information
240 |
241 | - Why does low statistical power elevate false discovery rate?
242 | - One way of developing intuition is to recognize a t-test (for
243 | example) as a ratio of variances. The numerator is
244 | treatment-specific, but the denominator is a measure of overall
245 | variability.
246 | - Variances are measured with uncertainty; over- or under-estimating
247 | the denominator variance has an asymmetric effect on a t-statistic
248 | or similar ratio, with an underestimate _inflating_ the statistic
249 | more dramatically than an overestimate deflates the statistic. Hence
250 | elevated false discovery rate.
251 | - Under the typical null hypothesis used in microarray or RNA-seq
252 | experiments, each gene may respond differently to the treatment
253 | (numerator variance) but the overall variability of a gene is
254 | the same, at least for genes with similar average expression
255 | - The strategy is to estimate the denominator variance as the
256 | between-group variance for the gene, _moderated_ by the average
257 | between-group variance across all genes.
258 | - This strategy exploits the fact that the same experimental design
259 | has been applied to all genes assayed, and is effective at
260 | moderating false discovery rate.
261 |
262 | ## 7. Comprehension
263 |
264 | Placing differentially expressed regions in context
265 |
266 | - Gene names associated with genomic ranges
267 | - Gene set enrichment and similar analysis
268 | - Proximity to regulatory marks
269 | - Integrate with other analyses, e.g., methylation, copy number,
270 | variants, ...
271 |
272 | 
273 | Correlation between genomic copy number and mRNA expression
274 | identified 38 mis-labeled samples in the TCGA ovarian cancer
275 | Affymetrix microarray dataset.
276 |
277 | # Experimental and statistical issues in depth
278 |
279 | ## Normalization
280 |
281 | `r Biocpkg("DESeq2")` `estimateSizeFactors()`, Anders and Huber,
282 | [2010](http://genomebiology.com/2010/11/10/r106)
283 |
284 | - For each gene: geometric mean of all samples.
285 | - For each sample: median ratio of the sample gene over the geometric
286 | mean of all samples
287 | - Functions other than the median can be used; control genes can be
288 | used instead
289 |
290 | `r Biocpkg("edgeR")` `calcNormFactors()` TMM method of Robinson and
291 | Oshlack, [2010](http://genomebiology.com/2010/11/3/r25)
292 |
293 | - Identify reference sample: library with upper quartile closest to
294 | the mean upper quartile of all libraries
295 | - Calculate M-value of each gene (log-fold change relative to reference)
296 | - Summarize library size as weighted trimmed mean of M-values.
297 |
298 | ## Dispersion
299 |
300 | `r Biocpkg("DESeq2")` `estimateDispersions()`
301 |
302 | - Estimate per-gene dispersion
303 | - Fit a smoothed relationship between dispersion and abundance
304 |
305 | `r Biocpkg("edgeR")` `estimateDisp()`
306 |
307 | - Common: single dispersion for all genes; appropriate for small
308 | experiments (<10? samples)
309 | - Tagwise: different dispersion for all genes; appropriate for larger
310 | / well-behaved experiments
311 | - Trended: bin based on abundance, estimate common dispersion within
312 | bin, fit a loess-smoothed relationship between binned dispersion and
313 | abundance
314 |
315 | # Analysis of designed experiments in R
316 |
317 | ## Example: t-test
318 |
319 | `t.test()`
320 |
321 | - `x`: vector of univariate measurements
322 | - `y`: `factor` describing experimental design
323 | - `var.equal=TRUE`: appropriate for relatively small experiments where
324 | no additional information available?
325 | - `formula`: alternative representation, `y ~ x`.
326 |
327 | ```{r sleep-t.test}
328 | head(sleep)
329 | plot(extra ~ group, data = sleep)
330 | ## Traditional interface
331 | with(sleep, t.test(extra[group == 1], extra[group == 2]))
332 | ## Formula interface
333 | t.test(extra ~ group, sleep)
334 | ## equal variance between groups
335 | t.test(extra ~ group, sleep, var.equal=TRUE)
336 | ```
337 |
338 | `lm()` and `anova()`
339 |
340 | - `lm()`: fit _linear model_.
341 | - `anova()`: statisitcal evaluation.
342 |
343 | ```{r sleep-lm}
344 | ## linear model; compare to t.test(var.equal=TRUE)
345 | fit <- lm(extra ~ group, sleep)
346 | anova(fit)
347 | ```
348 |
349 | - Under the hood: `formula`: translated into _model matrix_, used in
350 | `lm.fit()`.
351 | - With (implicit) intercept 1, last coefficient of model matrix
352 | reflects group effect
353 | - With intercept 0, _contrast_ between effects of coefficient 1 and
354 | coefficient 2 reflect group effect
355 |
356 | ```{r sleep-model.matrix}
357 | ## underlying model, used in `lm.fit()`
358 | model.matrix(extra ~ group, sleep) # last column indicates group effect
359 | model.matrix(extra ~ 0 + group, sleep) # contrast between columns
360 | ```
361 |
362 | - Covariate -- fit base model containing only covariate, test
363 | improvement in fit when model includes factor of interest
364 |
365 | ```{r sleep-diff}
366 | fit0 <- lm(extra ~ ID, sleep)
367 | fit1 <- lm(extra ~ ID + group, sleep)
368 | anova(fit0, fit1)
369 | t.test(extra ~ group, sleep, var.equal=TRUE, paired=TRUE)
370 | ```
371 |
372 | `genefilter::rowttests()`
373 |
374 | - t-tests for gene expression data
375 | - useful for exploratory analysis, but statistically sub-optimal
376 | - `x`: matrix of expression values
377 | - features x samples (reverse of how a 'statistician' would
378 | represent the data -- samples x features)
379 |
380 | - `fac`: factor of one or two levels describing experimental design
381 |
382 | Limitations
383 |
384 | - Assumes features are _independent_
385 | - Ignores common experimental design
386 | - Ignores multiple testing
387 |
388 | Consequences
389 |
390 | - Poor estimate of between-group variance for each feature
391 | - Elevated false discovery rate
392 |
393 | ## Common experimental designs
394 |
395 | - t-test: `count ~ factor`. Alternative: `count ~ 0 + factor` and
396 | contrasts
397 | - covariates: `count ~ covariate + factor`
398 | - Single factor, multiple levels (one-way ANOVA) -- statistical
399 | contrasts: specify model as `count ~ factor` or `count ~ 0 + factor`
400 | - Factorial designs -- main effects, `count ~ factor1 + factor2`; main
401 | effects and interactions, `count ~ factor1 * factor2`. Contrasts to
402 | ask specific questions
403 | - Paired designs: include ID as covariate (approximate, since ID is a
404 | random effect); `r Biocpkg("limma")` approach:
405 | `duplicateCorrelation()`
406 |
407 | # Practical: RNA-Seq gene-level differential expression
408 |
409 | Adapted from Love, Anders, and Huber's Bioconductor
410 | [work flow](http://bioconductor.org/help/workflows/rnaseqGene/)
411 |
412 | Michael Love [1], Simon Anders [2], Wolfgang Huber [2]
413 |
414 | [1] Department of Biostatistics, Dana-Farber Cancer Institute and
415 | Harvard School of Public Health, Boston, US;
416 |
417 | [2] European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
418 |
419 | ## 1. Experimental design
420 |
421 | The data used in this workflow is an RNA-Seq experiment of airway
422 | smooth muscle cells treated with dexamethasone, a synthetic
423 | glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids
424 | are used, for example, in asthma patients to prevent or reduce
425 | inflammation of the airways. In the experiment, four primary human
426 | airway smooth muscle cell lines were treated with 1 micromolar
427 | dexamethasone for 18 hours. For each of the four cell lines, we have a
428 | treated and an untreated sample. The reference for the experiment is:
429 |
430 | Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM,
431 | Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr,
432 | Tantisira KG, Weiss ST, Lu Q. "RNA-Seq Transcriptome Profiling
433 | Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates
434 | Cytokine Function in Airway Smooth Muscle Cells." PLoS One. 2014 Jun
435 | 13;9(6):e99625.
436 | PMID: [24926665](http://www.ncbi.nlm.nih.gov/pubmed/24926665).
437 | GEO: [GSE52778](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778).
438 |
439 | ## 2, 3, and 4: Wet lab, sequencing, and alignment
440 |
441 | - Paired-end sequencing leading to
442 | [FASTQ](http://en.wikipedia.org/wiki/FASTQ_format) files of reads
443 | and their quality scores.
444 |
445 | - Reads aligned to a reference genome or transcriptome, resulting in
446 | [BAM](http://samtools.github.io/hts-specs) files. Reads for this
447 | experiment were aligned to the Ensembl release 75 human reference
448 | genome using the [STAR](https://code.google.com/p/rna-star/) aligner
449 |
450 | ## 5. Reduction
451 |
452 | We use the `r Biocexptpkg("airway")` package to illustrate
453 | reduction. The package provides sample information, a subset of eight
454 | BAM files, and the known gene models required to count the reads.
455 |
456 | ```{r airway-bam-path}
457 | library(airway)
458 | path <- system.file(package="airway", "extdata")
459 | dir(path)
460 | ```
461 |
462 | ### Setup
463 |
464 | The ingredients for counting include are:
465 |
466 | a. Metadata describing samples. Read this using `read.csv()`.
467 |
468 | ```{r airway-csv}
469 | csvfile <- dir(path, "sample_table.csv", full=TRUE)
470 | sampleTable <- read.csv(csvfile, row.names=1)
471 | head(sampleTable)
472 | ```
473 |
474 | b. BAM files containing aligned reads. Create an object that
475 | references these files. What does the `yieldSize` argument mean?
476 |
477 | ```{r airway-bam}
478 | library(Rsamtools)
479 | filenames <- dir(path, ".bam$", full=TRUE)
480 | bamfiles <- BamFileList(filenames, yieldSize=1000000)
481 | names(bamfiles) <- sub("_subset.bam", "", basename(filenames))
482 | ```
483 |
484 | c. Known gene models. These might come from an existing `TxDb`
485 | package, or created from biomart or UCSC, or from a
486 | [GTF file](http://www.ensembl.org/info/website/upload/gff.html). We'll
487 | take the hard road, making a TxDb object from the GTF file used to
488 | align reads and using the TxDb to get all exons, grouped by gene.
489 |
490 | ```{r airway-gtf-to-txdb}
491 | library(GenomicFeatures)
492 | gtffile <- file.path(path, "Homo_sapiens.GRCh37.75_subset.gtf")
493 | txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character())
494 | genes <- exonsBy(txdb, by="gene")
495 | ```
496 |
497 | ### Counting
498 |
499 | After these preparations, the actual counting is easy. The function
500 | `summarizeOverlaps()` from the `r Biocpkg("GenomicAlignments")`
501 | package will do this. This produces a `SummarizedExperiment` object,
502 | which contains a variety of information about an experiment
503 |
504 | ```{r}
505 | library(GenomicAlignments)
506 | se <- summarizeOverlaps(features=genes, reads=bamfiles,
507 | mode="Union",
508 | singleEnd=FALSE,
509 | ignore.strand=TRUE,
510 | fragments=TRUE)
511 | colData(se) <- as(sampleTable, "DataFrame")
512 | se
513 | colData(se)
514 | rowData(se)
515 | head(assay(se))
516 | ```
517 |
518 | ## 6. Analysis using `r Biocpkg("DESeq2")`
519 |
520 | The previous section illustrates the reduction step on a subset of the
521 | data; here's the full data set
522 |
523 | ```{r airway-data}
524 | data(airway)
525 | se <- airway
526 | ```
527 |
528 | This object contains an informative `colData` slot -- prepared as
529 | described in the `r Biocexptpkg("airway")` vignette. In particular,
530 | the `colData()` include columns describing the cell line `cell` and
531 | treatment `dex` for each sample
532 |
533 | ```{r airway-cell-dex}
534 | colData(se)
535 | ```
536 |
537 | `r Biocpkg("DESeq2")` makes the analysis particularly easy, simply add
538 | the experimental design, run the pipeline, and extract the results
539 |
540 | ```{r airway-DESeq2-design}
541 | library(DESeq2)
542 | dds <- DESeqDataSet(se, design = ~ cell + dex)
543 | dds <- DESeq(dds)
544 | res <- results(dds)
545 | ```
546 |
547 | Simple visualizations / sanity checks include
548 |
549 | - Look at counts of strongly differentiated genes, to get a sense of
550 | how counts translate to the summary statistics reported in the
551 | result table
552 |
553 | ```{r plotcounts, fig.width=5, fig.height=5}
554 | topGene <- rownames(res)[which.min(res$padj)]
555 | res[topGene,]
556 | plotCounts(dds, gene=topGene, intgroup=c("dex"))
557 | ```
558 |
559 | - An 'MA' plot shows for each gene the between-group log-fold-change
560 | versus average log count; it should be funnel-shaped and
561 | approximately symmetric around `y=0`, with lots of between-treatment
562 | variation for genes with low counts.
563 |
564 | ```{r plotma}
565 | plotMA(res, ylim=c(-5,5))
566 | ```
567 |
568 | - Plot the distribution of (unadjusted) P values, which should be
569 | uniform (under the null) but with a peak at small P value (true
570 | positives, hopefully!)
571 |
572 | ```{r airway-DESeq2-hist}
573 | hist(res$pvalue, breaks=50)
574 | ```
575 |
576 | - Look at a 'volcano plot' of adjusted P-value versus log fold change,
577 | to get a sense of the fraction of up- versus down-regulated genes
578 |
579 | ```{r airway-DESeq2-volcano}
580 | plot(-log10(padj) ~ log2FoldChange, as.data.frame(res), pch=20)
581 | ```
582 |
583 | Many additional diagnostic approaches are described in the DESeq2 (and
584 | edgeR) vignettes, and in the RNA-seq gene differential expression work
585 | flow.
586 |
587 | ## 7. Comprehension
588 |
589 | see Part E, Gene Set Enrichment
590 |
591 |
--------------------------------------------------------------------------------
/vignettes/D_MachineLearning.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: D. Machine Learning
3 | author:
4 | Martin Morgan (mtmorgan@fredhutch.org)
5 | Sonali Arora (sarora@fredhutch.org)
6 | output:
7 | BiocStyle::html_document:
8 | toc: true
9 | vignette: >
10 | %\VignetteIndexEntry{D. Machine Learning}
11 | %\VignetteEngine{knitr::rmarkdown}
12 | \usepackage[utf8]{inputenc}
13 | ---
14 |
15 | ```{r style, echo=FALSE, results='asis'}
16 | BiocStyle::markdown()
17 | ```
18 |
19 | ## Introduction to Machine Learning
20 |
21 | Lets say that we are interested in predicting the run time of an athlete
22 | depending on his shoe size, height and weight in a study of 100 people.
23 | We can do so using a simple linear regression model where
24 |
25 | ```{r eval=FALSE}
26 | y = beta0 + beta1 * height + beta2 * weight + beta3 * shoe_size
27 | ```
28 |
29 | Here y is the response variable (run time), n is the number of observations
30 | (100 people), p is the number of variables/ features/ predictors (3 IE height,
31 | weight, shoe size), X is a nxp matrix
32 |
33 | This data set is a low dimensional data where n >> p but most of the biological
34 | data sets coming out of modern biological techniques are high dimensional IE
35 | n << p This poses statistical challenge and simple linear regression can no
36 | longer help us.
37 |
38 | For example,
39 |
40 | * Identify the risk factors(genes) for prostrate cancer based on gene
41 | expression data
42 | * Predict the chances of breast cancer survival in a patient.
43 | * Identify patterns of gene expression among different sub types of
44 | breast cancer
45 |
46 | In all of the 3 examples, listed above n, number of observations, is 30-40 patients
47 | whereas p, number of features, is approximately 30,000 genes. Try writing a linear
48 | regression formula for the outcome variable, y, in any of the above three
49 | scenarios..
50 |
51 | Listed below are things that can go wrong with high dimensional data
52 | - some of these predictors are useful, some are not
53 | - if we include too many predictors, we can over fit the data
54 |
55 | This is why we need Machine Learning. Lets first introduce some basic concepts
56 | and then dive into examples and a lab session.
57 |
58 | **Supervised Learning** - Use a data set X to predict the association with a
59 | response variable Y. The response variable can be continuous or categorical.
60 | For example: Predicting the chances of breast cancer survival in a patient.
61 |
62 | **Unsupervised Learning** - Discover the associations or patterns in X. No
63 | response variable is present. For example: Cluster similar genes into groups.
64 |
65 | **Training & Test Datasets** - Usually we split observation into test and
66 | training data sets. We fit the model on the training data set and evaluate on the
67 | test data set. The test set error rate is an estimate of the models performance
68 | on future data sets.
69 |
70 | **Model Selection** - We usually consider numerous models for a given problem.
71 | For example, we are trying to identify the genes responsible for a given
72 | disease using gene expression data set- we could have the following models
73 | a) model 1 - Use all 30000 genes from the array to build a model
74 | b) model 2 - we include only genes related to the pathway that we know is
75 | upregulated in that disease to build a model
76 | c) model 3 - include genes found in literature which are known to influence
77 | this disease
78 | It is highly recommended to use the test set only on our final model to see
79 | how our model will do with new, unseen data. So how do we pick the best
80 | model which can be tested on the test data set?
81 |
82 | **Cross-validation**
83 | We can use different approaches to find the best model. Lets look at the
84 | commonly used approaches, namely, validation set, leave one out
85 | cross-validation, k-fold cross validation.
86 |
87 | Briefly, the __validation set approach__ deals with diving the full data sets into
88 | 3 groups - training set, validation set and the test set. We train the models on
89 | the training set, evaluate their performance on the validation set and then the
90 | best model is chosen to fit on the test set.
91 |
92 | The __leave one out cross validation__ starts with fitting n models (where n is
93 | number of observations in the training data set), each on n-1 observations,
94 | evaluating each model on the left-out observation. The best model is the one
95 | for which the total test error is the smallest and that is then used to predict
96 | the test set.
97 |
98 | Lastly the __5 fold cross validation__ (here k=5), is splitting the training
99 | data set into 5 sets and repeatedly training the model on the other 4 sets and
100 | evaluating the performance on the fifth.
101 |
102 | **Bias, Variance, Overfitting** - Bias refers to the average difference between
103 | the actual betas and the predicted betas, Variance refers to the amount by
104 | which the betas differ across experiments. As the model complexity(no of
105 | variables) increases, the bias decreases and the variance increases. This is
106 | know as the Bias-Variance Tradeoff and a model that has too much of variance,
107 | is said to be over fit.
108 |
109 | ## Datasets
110 |
111 | For **Unsupervised learning**, we will use RNA-Seq count data from the
112 | Biocoductor package, `r Biocpkg("airway")`. From the abstract, a brief
113 | description of the RNA-Seq experiment on airway smooth muscle (ASM) cell
114 | lines: “Using RNA-Seq, a high-throughput sequencing method, we characterized
115 | transcriptomic changes in four primary human ASM cell lines that were treated
116 | with dexamethasone - a potent synthetic glucocorticoid (1 micromolar for
117 | 18 hours).”
118 |
119 | ```{r message=FALSE}
120 | library(airway)
121 | data("airway")
122 | se <- airway
123 | colData(se)
124 | library("DESeq2")
125 | dds <- DESeqDataSet(se, design = ~ cell + dex)
126 | ```
127 |
128 | For **Supervised learning**, we will use cervical count data from the
129 | Biocoductor package, `r Biocpkg("MLSeq")`. This data set contains
130 | expressions of 714 miRNA's of human samples. There are 29 tumor and 29
131 | non-tumor cervical samples. For learning purposes, we can treat these
132 | as two separate groups and run various classification algorithms.
133 |
134 | ```{r message=FALSE}
135 | library(MLSeq)
136 | filepath = system.file("extdata/cervical.txt", package = "MLSeq")
137 | cervical = read.table(filepath, header = TRUE)
138 | ```
139 |
140 |
141 | ## Unsupervised Learning
142 |
143 | Unsupervised Learning is a set of statistical tools intended for the setting
144 | in which we have only a set of 'p' features measured on 'n' observations.
145 | We are primarily interested in discovering interesting
146 | things about the 'p' features.
147 |
148 | Unsupervised Learning is often performed as a part of Exploratory Data Analysis.
149 | These tools help us to get a good idea about the data set. Unlike a supervised
150 | learning problem, where we can use prediction to gain some confidence about our
151 | learning algorithm, there is no way to check our model. The learning algorithm
152 | is thus, aptly named "unsupervised".
153 |
154 | **RLOG TRANSFORMATION**
155 |
156 | Many common statistical methods for exploratory analysis of multidimensional
157 | data, especially methods for clustering and ordination (e.g.,
158 | principal-component analysis and the like), work best for (at least
159 | approximately) homoskedastic data; this means that the variance of an observed
160 | quantity (here, the expression strength of a gene) does not depend on the mean.
161 |
162 | In RNA-Seq data, the variance grows with the mean.If one performs PCA
163 | (principal components analysis) directly on a matrix of normalized read counts,
164 | the result typically depends only on the few most strongly expressed genes
165 | because they show the largest absolute differences between samples.
166 |
167 | As a solution, DESeq2 offers the regularized-logarithm transformation, or rlog
168 | for short. See the help for ?rlog for more information and options.
169 |
170 | The function rlog returns a SummarizedExperiment object which contains the
171 | rlog-transformed values in its assay slot:
172 |
173 | ```{r}
174 | rld <- rlog(dds)
175 | head(assay(rld))
176 | ```
177 |
178 | To assess overall similarity between samples: Which samples are similar to each
179 | other, which are different? Does this fit to the expectation from the
180 | experiment's design? We use the R function dist to calculate the Euclidean
181 | distance between samples. To avoid that the distance measure is dominated by
182 | a few highly variable genes, and have a roughly equal contribution from all
183 | genes, we use it on the rlog-transformed data
184 |
185 | ```{r}
186 | sampleDists <- dist( t( assay(rld) ) )
187 | sampleDists
188 | ```
189 | Note the use of the function t to transpose the data matrix. We need this
190 | because dist calculates distances between data rows and our samples constitute
191 | the columns.
192 |
193 | **HEATMAP**
194 |
195 | We visualize the sample-to-sample distances in a heatmap, using the
196 | function heatmap.2 from the gplots package. Note that we have changed the row
197 | names of the distance matrix to contain treatment type and patient number
198 | instead of sample ID, so that we have all this information in view when
199 | looking at the heatmap.
200 |
201 | ```{r message=FALSE}
202 | library("gplots")
203 | library("RColorBrewer")
204 |
205 | sampleDistMatrix <- as.matrix( sampleDists )
206 | rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep="-" )
207 | colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
208 | hc <- hclust(sampleDists)
209 | heatmap.2( sampleDistMatrix, Rowv=as.dendrogram(hc),
210 | symm=TRUE, trace="none", col=colors,
211 | margins=c(2,10), labCol=FALSE )
212 |
213 | ```
214 |
215 | **PCA**
216 |
217 | Another way to visualize sample-to-sample distances is a principal-components
218 | analysis (PCA). In this ordination method, the data points (i.e., here, the
219 | samples) are projected onto the 2D plane such that they spread out in the two
220 | directions which explain most of the differences in the data. The x-axis is
221 | the direction (or principal component) which separates the data points the most.
222 | The amount of the total variance which is contained in the direction is
223 | printed in the axis label. Here, we have used the function plotPCA which comes
224 | with DESeq2. The two terms specified by intgroup are the interesting groups
225 | for labelling the samples; they tell the function to use them to choose colors.
226 |
227 | ```{r}
228 | plotPCA(rld, intgroup = c("dex", "cell"))
229 | ```
230 |
231 | From both visualizations, we see that the differences between cells are
232 | considerable, though not stronger than the differences due to treatment
233 | with dexamethasone. This shows why it will be important to account for this
234 | in differential testing by using a paired design (“paired”, because each dex
235 | treated sample is paired with one untreated sample from the same cell line).
236 | We are already set up for this by using the design formula ~ cell + dex when
237 | setting up the data object in the beginning.
238 |
239 | **MDS**
240 | Another plot, very similar to the PCA plot, can be made using the
241 | multidimensional scaling (MDS) function in base R. This is useful when we don't
242 | have the original data, but only a matrix of distances. Here we have the MDS
243 | plot for the distances calculated from the rlog transformed counts:
244 |
245 | ```{r}
246 | library(ggplot2)
247 | mds <- data.frame(cmdscale(sampleDistMatrix))
248 | mds <- cbind(mds, colData(rld))
249 | qplot(X1,X2,color=dex,shape=cell,data=as.data.frame(mds))
250 | ```
251 |
252 | ### Exercise:
253 | Use the plotMDS function from the limma package to make a simila plot.
254 | What is the advtange of using this function over base R's cmdscale?
255 |
256 | **Solutions:**
257 |
258 | A similar plot can be made using the plotMDS() function in limma where the input
259 | is a matrix of log-fold expression values. Here the advantage is that the
260 | distances on plot are proportional to log2-fold change and not only is the plot
261 | created, but the object (with distance matrix) is also returned.
262 |
263 | ```{r plotMDS}
264 | suppressPackageStartupMessages({
265 | library(limma)
266 | library(DESeq2)
267 | library(airway)
268 | })
269 | plotMDS(assay(rld), col=as.integer(dds$dex), pch=as.integer(dds$cell))
270 | ```
271 |
272 |
273 |
274 | ## Supervised Learning
275 |
276 | In supervised learning, along with the 'p' features, we
277 | also have the a response Y measured on the same n observations. The goal is then
278 | to predict Y using X (n x p matrix) for new observations.
279 |
280 | For the cervical data, we know that the first 29 are non-Tumor samples
281 | whereas the last 29 are Tumor samples. We will code these as 0 and 1
282 | respectively. We will randomly sample 30% of our data and use that as a
283 | test set. The remaining 70% of the data will be used as training data
284 |
285 | ```{r }
286 | set.seed(9)
287 |
288 | class = data.frame(condition = factor(rep(c(0, 1), c(29, 29))))
289 |
290 | nTest = ceiling(ncol(cervical) * 0.2)
291 | ind = sample(ncol(cervical), nTest, FALSE)
292 |
293 | cervical.train = cervical[, -ind]
294 | cervical.train = as.matrix(cervical.train + 1)
295 | classtr = data.frame(condition = class[-ind, ])
296 |
297 | cervical.test = cervical[, ind]
298 | cervical.test = as.matrix(cervical.test + 1)
299 | classts = data.frame(condition = class[ind, ])
300 | ```
301 |
302 | MLSeq aims to make computation less complicated for a user and
303 | allows one to learn a model using various classifier's with one single function.
304 |
305 | The main function of this package is classify which requires data in the form of
306 | a DESeqDataSet instance. The DESeqDataSet is a subclass of SummarizedExperiment,
307 | used to store the input values, intermediate calculations and results of an
308 | analysis of differential expression.
309 |
310 | So lets create DESeqDataSet object for both the training and test set, and run
311 | DESeq on it.
312 |
313 | ```{r}
314 | cervical.trainS4 = DESeqDataSetFromMatrix(countData = cervical.train,
315 | colData = classtr, formula(~condition))
316 | cervical.trainS4 = DESeq(cervical.trainS4, fitType = "local")
317 |
318 | cervical.testS4 = DESeqDataSetFromMatrix(countData = cervical.test, colData = classts,
319 | formula(~condition))
320 | cervical.testS4 = DESeq(cervical.testS4, fitType = "local")
321 |
322 | ```
323 | Classify using Support Vector Machines.
324 |
325 | ```{r}
326 | svm = classify(data = cervical.trainS4, method = "svm", normalize = "deseq",
327 | deseqTransform = "vst", cv = 5, rpt = 3, ref = "1")
328 | svm
329 | ```
330 |
331 | It returns an object of class 'MLseq' and we observe that it successfully
332 | fitted a model with 97.8% accuracy. We can access the slots of this S4 object by
333 | ```{r}
334 | getSlots("MLSeq")
335 | ```
336 | And also, ask about the model trained.
337 |
338 | ```{r}
339 | trained(svm)
340 | ```
341 |
342 | We can predict the class labels of our test data using "predict"
343 |
344 | ```{r}
345 | pred.svm = predictClassify(svm, cervical.testS4)
346 | table(pred.svm, relevel(cervical.testS4$condition, 2))
347 | ```
348 |
349 | The other classification methods available are 'randomforest', 'cart' and
350 | 'bagsvm'.
351 |
352 | ### Exercise:
353 |
354 | Train the same training data and test data using randomForest.
355 |
356 | **Solutions:**
357 |
358 | ```{r}
359 | rf = classify(data = cervical.trainS4, method = "randomforest",
360 | normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "1")
361 | trained(rf)
362 | pred.rf = predictClassify(rf, cervical.testS4)
363 | table(pred.rf, relevel(cervical.testS4$condition, 2))
364 | ```
365 |
366 | ## SessionInfo
367 |
368 | ```{r}
369 | sessionInfo()
370 | ```
371 |
372 | ## References
373 |
374 | 1. Zararsiz G, Goksuluk D, Korkmaz S, Eldem V, Duru IP, Unver T and Ozturk A (2014). MLSeq: Machine learning interface for RNA-Seq data. R package version 1.3.0.
375 | 2. Himes, E. B, Jiang, X., Wagner, P., Hu, R., Wang, Q., Klanderman, B., Whitaker, M. R, Duan, Q., Lasky-Su, J., Nikolos, C., Jester, W., Johnson, M., Panettieri, A. R, Tantisira, G. K, Weiss, T. S, Lu and Q. (2014). “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS ONE, 9(6), pp. e99625. http://www.ncbi.nlm.nih.gov/pubmed/24926665.
376 | 3. An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
377 | 4. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani, Jerome Friedman
378 |
379 |
--------------------------------------------------------------------------------
/vignettes/E_GeneSetEnrichment.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: E. Gene Set Enrichiment
3 | author: Martin Morgan (mtmorgan@fredhutch.org)
4 | date: "`r Sys.Date()`"
5 | output:
6 | BiocStyle::html_document:
7 | toc: true
8 | vignette: >
9 | %\VignetteIndexEntry{E. Gene Set Enrichment}
10 | %\VignetteEngine{knitr::rmarkdown}
11 | \usepackage[utf8]{inputenc}
12 | ---
13 |
14 | ```{r style, echo=FALSE, results='asis'}
15 | BiocStyle::markdown()
16 | suppressPackageStartupMessages({
17 | library(edgeR)
18 | library(goseq)
19 | library(org.Hs.eg.db)
20 | library(GO.db)
21 | })
22 | ```
23 |
24 | # Motivation
25 |
26 | Is expression of genes in a gene set associated with experimental
27 | condition?
28 |
29 | - E.g., Are there unusually many up-regulated genes in the gene set?
30 |
31 | Many methods, a recent review is Kharti et al., 2012.
32 |
33 | - Over-representation analysis (ORA) -- are differentially expressed
34 | (DE) genes in the set more common than expected?
35 | - Functional class scoring (FCS) -- summarize statistic of DE of genes
36 | in a set, and compare to null
37 | - Pathway topology (PT) -- include pathway knowledge in assessing DE
38 | of genes in a set
39 |
40 | ## What is a gene set?
41 |
42 | **Any** _a priori_ classification of `genes' into biologically
43 | relevant groups
44 |
45 | - Members of same biochemical pathway
46 | - Proteins expressed in identical cellular compartments
47 | - Co-expressed under certain conditions
48 | - Targets of the same regulatory elements
49 | - On the same cytogenic band
50 | - ...
51 |
52 | Sets do not need to be...
53 |
54 | - Exhaustive
55 | - Disjoint
56 |
57 | ## Collections of gene sets
58 |
59 | Gene Ontology ([GO](http://geneontology.org)) Annotation (GOA)
60 |
61 | - CC Cellular Components
62 | - BP Biological Processes
63 | - MF Molecular Function
64 |
65 | Pathways
66 |
67 | - [MSigDb](http://www.broadinstitute.org/gsea/msigdb/)
68 | - [KEGG](http://genome.jp/kegg) (no longer freely available)
69 | - [reactome](http://reactome.org)
70 | - [PantherDB](http://pantherdb.org)
71 | - ...
72 |
73 | E.g., [MSigDb](http://www.broadinstitute.org/gsea/msigdb/)
74 |
75 | - c1 Positional gene sets -- chromosome \& cytogenic band
76 | - c2 Curated Gene Sets from online pathway databases,
77 | publications in PubMed, and knowledge of domain experts.
78 | - c3 motif gene sets based on conserved cis-regulatory motifs
79 | from a comparative analysis of the human, mouse, rat, and dog
80 | genomes.
81 | - c4 computational gene sets defined by mining large collections
82 | of cancer-oriented microarray data.
83 | - c5 GO gene sets consist of genes annotated by the same GO
84 | terms.
85 | - c6 oncogenic signatures defined directly from microarray gene
86 | expression data from cancer gene perturbations.
87 | - c7 immunologic signatures defined directly from microarray
88 | gene expression data from immunologic studies.
89 |
90 | # Statistical approaches
91 |
92 | Initially based on a presentation by Simon Anders,
93 | [CSAMA 2010](http://marray.economia.unimi.it/2009/material/lectures/L8_Gene_Set_Testing.pdf)
94 |
95 | ## Approach 1: hypergeometric tests
96 |
97 | Steps
98 |
99 | 1. Classify each gene as 'differentially expressed' DE or not, e.g.,
100 | based on _P_ < 0.05
101 | 2. Are DE genes in the set more common than DE genes not in the set?
102 |
103 |
104 |
105 |
110 |
116 |
117 |
118 |
119 | Differentially |
120 | Yes |
121 | k |
122 | K |
123 |
124 |
125 | expressed? |
126 | No |
127 | n - k |
128 | N - K |
129 |
130 |
131 |
132 |
133 | 3. Fisher hypergeometric test, via `fiser.test()` or `r Biocpkg("GOstats")`
134 |
135 | Notes
136 |
137 | - Conditional hypergeometric to accommodate GO DAG, `r Biocpkg("GOstats")`
138 | - But: artificial division into two groups
139 |
140 | ## Approach 2: enrichment score
141 |
142 | - Mootha et al., 2003; modified Subramanian et al., 2005.
143 |
144 | Steps
145 |
146 | - Sort genes by log fold change
147 | - Calculate running sum: incremented when gene in set, decremented when not.
148 | - Maximum of the running sum is enrichment score ES; large ES means
149 | that genes in set are toward top of list.
150 | - Permuting subject labels for signficance
151 |
152 |
153 |
154 | ## Approach 3: category $t$-test
155 |
156 | E.g., Jiang \& Gentleman, 2007; \Biocpkg{Category}
157 |
158 | - Summarize $t$ (or other) statistic in each set
159 | - Test for significance by permuting the subject labels
160 | - Much more straight-forward to implement
161 |
162 | Expression in NEG vs BCR/ABL samples for genes in the 'ribosome' KEGG
163 | pathway; `r Biocpkg("Category")` vignette.
164 |
165 |
166 |
167 |
168 | ## Competitive versus self-contained null hypothesis
169 |
170 | Goemann & Bühlmann, 2007
171 |
172 | - Competitive null: The genes in the gene set do not have stronger
173 | association with the subject condition than other genes. (Approach
174 | 1, 2)
175 | - Self-contained null: The genes in the gene set do not have any
176 | association with the subject condition. (Approach 3)
177 | - Probably, self-contained null is closer to actual question of interest
178 | - Permuting subjects (rather than genes) is appropriate
179 |
180 | ## Approach 4: linear models
181 |
182 | E.g., Hummel et al., 2008, \Biocpkg{GlobalAncova}
183 |
184 | - Colorectal tumors have good ('stage II') or bad ('stage III')
185 | prognosis. Do genes in the p53 pathway (_just one gene set!_) show
186 | different activity at the two stages?
187 | - Linear model incorporates covariates -- sex of patient, location of tumor
188 |
189 | `r Biocpkg("limma")`
190 |
191 | - Majewski et al., 2010 `romer()` and Wu \& Smythe 2012 `camera()` for
192 | enrichment (competitive null) linear models
193 | - Wu et al., 2010: `roast()`, `mroast()` for self-contained null
194 | linear models
195 |
196 | ## Approach 5: pathway topology
197 |
198 | E.g., Tarca et al., 2009, \Biocpkg{SPIA}
199 |
200 | - Incorporate pathway topology (e.g., interactions between gene products) into signficance testing
201 |
202 | - Signaling Pathway Impact Analysis
203 |
204 | - Combined evidence: pathway over-representation $P_{NDE}$; unusual
205 | signaling $P_{PERT}$ (equation 1 of Tarca et al.)
206 |
207 | Evidence plot, colorectal cancer. Points: pathway gene sets.
208 | Significant after Bonferroni (red) or FDR (blue) correction.
209 |
210 |
211 |
212 | ## Issues with sequence data?
213 |
214 | - All else being equal, long genes receive more reads than short genes
215 | - Per-gene $P$ values proportional to gene size
216 |
217 | E.g., Young et al., 2010, `r Biocpkg("goseq")`
218 |
219 | - Hypergeometric, weighted by gene size
220 | - Substantial differences
221 | - Better: read depth??
222 |
223 | DE genes vs. transcript length. Points: bins of 300 genes. Line:
224 | fitted probability weighting function.
225 |
226 |
227 |
228 | ## Approach 6: _de novo_ discovery
229 |
230 | - So far: analogous to supervised machine learning, where pathways are
231 | known in advance
232 | - What about unsupervised discovery?
233 |
234 | Example: Langfelder & Hovarth,
235 | [WGCNA](http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/)
236 |
237 | - Weighted correlation network analysis
238 | - Described in Langfelder & Horvath,
239 | [2008](http://www.biomedcentral.com/1471-2105/9/559)
240 |
241 | ## Representing gene sets in R
242 |
243 | - Named `list()`, where names of the list are sets, and each element
244 | of the list is a vector of genes in the set.
245 | - `data.frame()` of set name / gene name pairs
246 | - `r Biocpkg("GSEABase")`
247 |
248 | ## Conclusions
249 |
250 | Gene set enrichment classifications
251 |
252 | - Kharti et al: Over-representation analysis; functional class
253 | scoring; pathway topology
254 | - Goemann \& Bühlmann: Competitive vs.\ self-contained null
255 |
256 | Selected \Bioconductor{} Packages
257 |
258 | | Approach | Packages |
259 | |-------------------|---------------------------------------------|
260 | | Hypergeometric | `r Biocpkg("GOstats")`, `r Biocpkg("topGO")`|
261 | | Enrichment | `r Biocpkg("limma")``::romer()` |
262 | | Category $t$-test | `r Biocpkg("Category")` |
263 | | Linear model | `r Biocpkg("GlobalAncova")`, `r Biocpkg("GSEAlm")`, `r Biocpkg("limma")``::roast()` |
264 | | Pathway topology | `r Biocpkg("SPIA")` |
265 | | Sequence-specific | `r Biocpkg("goseq")` |
266 | | _de novo_ | `r CRANpkg("WGCNA")` |
267 |
268 | # Practical
269 |
270 | This practical is based on section 6 of the `r Biocpkg("goseq")`
271 | [vignette](http://bioconductor.org/packages/devel/bioc/vignettes/goseq/inst/doc/goseq.pdf).
272 |
273 | ## 1-6 Experimental design, ..., Analysis of gene differential expression
274 |
275 | This (relatively old) experiment examined the effects of androgen
276 | stimulation on a human prostate cancer cell line, LNCaP (Li et al.,
277 | [2008](https://doi.org/10.1073/pnas.0807121105)). The experiment
278 | used short (35bp) single-end reads from 4 control and 3 untreated
279 | lines. Reads were aligned to hg19 using Bowtie, and counted using
280 | ENSEMBL 54 gene models.
281 |
282 | Input the data to `r Biocpkg("edgeR")`'s `DGEList` data structure.
283 |
284 | ```{r prostate-edgeR-input}
285 | library(edgeR)
286 | path <- system.file(package="goseq", "extdata", "Li_sum.txt")
287 |
288 | table.summary <- read.table(path, sep='\t', header=TRUE, stringsAsFactors=FALSE)
289 | counts <- table.summary[,-1]
290 | rownames(counts) <- table.summary[,1]
291 | grp <- factor(rep(c("Control","Treated"), times=c(4,3)))
292 | summarized <- DGEList(counts, lib.size=colSums(counts), group=grp)
293 | ```
294 |
295 | Use a 'common' dispersion estimate, and compare the two groups using
296 | an exact test
297 |
298 | ```{r prostate-edgeR-de}
299 | disp <- estimateCommonDisp(summarized)
300 | tested <- exactTest(disp)
301 | topTags(tested)
302 | ```
303 |
304 | ## 7. Comprehension
305 |
306 | Start by extracting all P values, then correcting for multiple
307 | comparison using `p.adjust()`. Classify the genes as differentially
308 | expressed or not.
309 |
310 | ```{r prostate-edgeR-padj}
311 | padj <- with(tested$table, {
312 | keep <- logFC != 0
313 | value <- p.adjust(PValue[keep], method="BH")
314 | setNames(value, rownames(tested)[keep])
315 | })
316 | genes <- padj < 0.05
317 | table(genes)
318 | ```
319 |
320 | ### Gene symbol to pathway
321 |
322 | Under the hood, `r Biocpkg("goseq")` uses Bioconductor annotation
323 | packages (in this case `r Biocannopkg("org.Hs.eg.db")` and `r
324 | Biocannopkg("GO.db")` to map from gene symbols to GO pathways.
325 |
326 | Expore these packages through the `columns()` and `select()`
327 | functions. Can you map between ENSEMBL gene identifiers (the row names
328 | of `topTable()`) to GO pathway? What about 'drilling down' on
329 | particular GO identifiers to discover the term's definition?
330 |
331 | ### Probability weighting function
332 |
333 | Calculate the weighting for each gene. This looks up the gene lengths
334 | in a pre-defined table (how could these be calculated using TxDb
335 | packages? What challenges are associated with calculating these
336 | 'weights', based on the knowledge that genes typically consist of
337 | several transcripts, each expressed differently?)
338 |
339 | ```{r prostate-edgeR-pwf}
340 | pwf <- nullp(genes,"hg19","ensGene")
341 | head(pwf)
342 | ```
343 |
344 | ### Over- and under-representation
345 |
346 | Perform the main analysis. This includes association of genes to GO
347 | pathway
348 |
349 | ```{r prostate-goseq-wall}
350 | GO.wall <- goseq(pwf, "hg19", "ensGene")
351 | head(GO.wall)
352 | ```
353 |
354 | ### What if we'd ignored gene length?
355 |
356 | Here we do the same operation, but ignore gene lengths
357 |
358 | ```{r prostate-goseq-nobias}
359 | GO.nobias <- goseq(pwf,"hg19","ensGene",method="Hypergeometric")
360 | ```
361 |
362 | Compare the over-represented P-values for each set, under the
363 | different methods
364 |
365 | ```{r prostate-goseq-compare, fig.width=5, fig.height=5}
366 | idx <- match(GO.nobias$category, GO.wall$category)
367 | plot(log10(GO.nobias[, "over_represented_pvalue"]) ~
368 | log10(GO.wall[idx, "over_represented_pvalue"]),
369 | xlab="Wallenius", ylab="Hypergeometric",
370 | xlim=c(-5, 0), ylim=c(-5, 0))
371 | abline(0, 1, col="red", lwd=2)
372 | ```
373 |
374 | # References
375 |
376 | - Khatri et al., 2012, PLoS Comp Biol 8.2: e1002375.
377 | - Subramanian et al., 2005, PNAS 102.43: 15545-15550.
378 | - Jiang \& Gentleman, 2007, Bioinformatics Feb 1;23(3):306-13.
379 | - Goeman \& B\"uhlmann, 2007, Bioinformatics 23.8: 980-987.
380 | - Hummel et al., 2008, Bioinformatics 24.1: 78-85.
381 | - Wu \& Smyth 2012, Nucleic Acids Research 40, e133.
382 | - Wu et al., 2010 Bioinformatics 26, 2176-2182.
383 | - Majewski et al., 2010, Blood, published online 5 May 2010.
384 | - Tarca et al., 2009, Bioinformatics 25.1: 75-82.
385 | - Young et al., 2010, Genome Biology 11:R14.
386 |
--------------------------------------------------------------------------------
/vignettes/F_ChIPSeq.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: F. ChIP-seq
3 | author: Martin Morgan (mtmorgan@fredhutch.org)
4 | date: "`r Sys.Date()`"
5 | output:
6 | BiocStyle::html_document:
7 | markdown_strict: true
8 | toc: true
9 | vignette: >
10 | %\VignetteIndexEntry{F. ChIP-seq}
11 | %\VignetteEngine{knitr::rmarkdown}
12 | \usepackage[utf8]{inputenc}
13 | ---
14 |
15 | ```{r style, echo=FALSE, results='asis'}
16 | BiocStyle::markdown()
17 | ```
18 |
19 | ```{r setup, echo=FALSE}
20 | options(digits=3)
21 | suppressPackageStartupMessages({
22 | library(csaw)
23 | library(edgeR)
24 | library(GenomicRanges)
25 | library(ChIPseeker)
26 | library(genefilter)
27 | })
28 | ```
29 |
30 | # Motivation & work flow
31 |
32 | Key references
33 |
34 | - Kharchenko, Tolstorukov, and Park
35 | ([2008](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2597701/)).
36 | - Lun and Smyth ([2014](https://doi.org/10.1093/nar/gku351)).
37 |
38 | ## ChIP-seq
39 |
40 | Kharchenko et
41 | al. ([2008](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2597701/)).
42 | 
43 |
44 | - Tags versus sequenced reads; single-end read extension in 3'
45 | direction
46 | - Strand shift / cross-correlation
47 | - Defined (narrow, e.g., transcription factor binding sites) versus
48 | diffuse (e.g., histone marks) peaks
49 |
50 | ChIP-seq for differential binding
51 |
52 | - Designed experiment with replicate samples per treatment
53 | - Analysis using insights from microarrays / RNA-seq
54 |
55 | Novel statistical issues
56 |
57 | - Inferring peaks without 'data snooping' (using the same data twice,
58 | once to infer peaks, once to estimate differential binding)
59 | - Retaining power
60 | - Minimizing false discovery rate
61 |
62 | ## Work flow
63 |
64 | - Following Bailey et al.,
65 | [2013](https://doi.org/10.1371/journal.pcbi.1003326)
66 |
67 | 
68 |
69 | Experimental design and execution
70 |
71 | - Single sample
72 |
73 | - ChIPed transcription factor and\ldots
74 | - Input (fragmented genomic DNA) or control (e.g., IP with
75 | non-specific antibody such as immunoglobulin G, IgG)
76 |
77 | - Designed experiments
78 |
79 | - Replication of TF / control pairs
80 |
81 | Sequencing & alignment
82 |
83 | - Sequencing depth rules of thumb: $>10M$ reads for narrow peaks,
84 | $>20M$ for broad peaks
85 | - Long & paired end useful but not essential -- alignment in ambiguous
86 | regions
87 | - Basic aligners generally adequate, e.g., no need to align splice
88 | junctions
89 | - Sims et al., [2014](https://doi.org/10.1038/nrg3642)
90 |
91 | Peak calling
92 |
93 | - Very large number of peak calling programs; some specialized for
94 | e.g., narrow vs. broad peaks.
95 | - Commmonly used: [MACS](http://liulab.dfci.harvard.edu/MACS/),
96 | PeakSeq, CisGenome, ...
97 | - MACS: Model-based Analysis for ChIP-Seq, Liu et al.,
98 | [2008](https://doi.org/10.1186/gb-2008-9-9-r137)
99 |
100 | - Scale control tag counts to match ChIP counts
101 | - Center peaks by shifting $d/2$
102 | - Model occurrence of a tag as a Poisson process
103 | - Look for fixed width sliding windows with exceess number of tag
104 | enrichment
105 | - Empirical FDR: Swap ChIP and control samples; FDR is \# control
106 | peaks / \# ChIP peaks
107 | - Output: BED file of called peaks
108 |
109 | Down-stream analysis
110 |
111 | - Annotation: what genes are my peaks near?
112 | - Differential representation: which peaks are over- or
113 | under-represented in treatment 1, compared to treatment 2?
114 | - Motif identification (peaks over known motifs?) and discovery
115 | - Integrative analysis, e.g., assoication of regulatory elements and
116 | expression
117 |
118 | ## Peak calling
119 |
120 | 'Known' ranges
121 |
122 | - Count tags in pre-defined ranges, e.g., promoter regions of known
123 | genes
124 | - Obvious limitations, e.g., regulatory elements not in specified
125 | ranges; specified range contains multiple regulatory elements with
126 | complementary behavior
127 |
128 | _de novo_ windows
129 |
130 | - Width: narrow peaks, 1bp; broad peaks, 150bp
131 | - Offset: 25-100bp; influencing computational burden
132 |
133 | _de novo_ peak calling
134 |
135 | - Third-party software (many available;
136 | [MACS](http://liulab.dfci.harvard.edu/MACS/) commonly used)
137 | - Various strategies for calling peaks -- Lun & Smyth,
138 | [Table 1](http://nar.oxfordjournals.org/content/42/11/e95/T1.expansion.html)
139 |
140 | - Call each sample independently; intersection or union of peaks
141 | across samples, ...
142 | - Call peaks from a pooled library
143 | - ...
144 |
145 | - Relevant slides [pdf](http://bioconductor.org/help/course-materials/2014/CSAMA2014/4_Thursday/lectures/ChIPSeq_slides.pdf)
146 |
147 | ## Peak calling across libraries
148 |
149 | - Table 1: Description of peak calling strategies. Each
150 | strategy is given an identifier and is described by the mode in
151 | which MACS is run, the libraries on which it is run and the
152 | consolidation operation (if any) performed to combine peaks between
153 | libraries or groups. For method 6, the union of the peaks in each
154 | direction of enrichment is taken.
155 |
156 |
157 |
158 |
159 | ID |
160 | Mode |
161 | Library |
162 | Operation |
163 |
164 |
165 |
166 |
167 | 1 |
168 | Single-sample |
169 | Individual |
170 | Union |
171 |
172 |
173 | 2 |
174 | Single-sample |
175 | Individual |
176 | Intersection |
177 |
178 |
179 | 3 |
180 | Single-sample |
181 | Individual |
182 | At least 2 |
183 |
184 |
185 | 4 |
186 | Single-sample |
187 | Pooled over group |
188 | Union |
189 |
190 |
191 | 5 |
192 | Single-sample |
193 | Pooled over group |
194 | Intersection |
195 |
196 |
197 | 6 |
198 | Two-sample |
199 | Pooled over group |
200 | Union |
201 |
202 |
203 | 7 |
204 | Single-sample |
205 | Pooled over all |
206 | – |
207 |
208 |
209 |
210 |
211 | - How to choose? -- Lun & Smyth,
212 |
213 | - Under the null hypothesis, type I error rate is uniform
214 | - [Table 2](http://nar.oxfordjournals.org/content/42/11/e95/T2.expansion.html):
215 | consequences for type I error
216 | - Best strategy: call peaks from a pooled library
217 | - Table 2: The observed type I error rate when testing
218 | for differential enrichment using counts from each peak calling
219 | strategy. Error rates for a range of specified error thresholds
220 | are shown. All values represent the mean of 10 simulation
221 | iterations with the standard error shown in brackets. RA:
222 | reference analysis using 10 000 randomly chosen true peaks.
223 |
224 |
225 |
226 |
227 | ID |
228 | Error rate |
229 |
230 |
231 | |
232 | 0.01 |
233 | 0.05 |
234 | 0.1 |
235 |
236 |
237 |
238 |
239 | RA |
240 | 0.010 (0.000) |
241 | 0.051 (0.001) |
242 | 0.100 (0.002) |
243 |
244 |
245 | 1 |
246 | 0.002 (0.000) |
247 | 0.019 (0.001) |
248 | 0.053 (0.001) |
249 |
250 |
251 | 2 |
252 | 0.003 (0.000) |
253 | 0.030 (0.000) |
254 | 0.073 (0.001) |
255 |
256 |
257 | 3 |
258 | 0.006 (0.000) |
259 | 0.042 (0.001) |
260 | 0.092 (0.001) |
261 |
262 |
263 | 4 |
264 | 0.033 (0.001) |
265 | 0.145 (0.001) |
266 | 0.261 (0.002) |
267 |
268 |
269 | 5 |
270 | 0.000 (0.000) |
271 | 0.001 (0.000) |
272 | 0.005 (0.000) |
273 |
274 |
275 | 6 |
276 | 0.088 (0.006) |
277 | 0.528 (0.013) |
278 | 0.893 (0.006) |
279 |
280 |
281 | 7 |
282 | 0.010 (0.000) |
283 | 0.049 (0.001) |
284 | 0.098 (0.001) |
285 |
286 |
287 |
288 |
289 |
290 | ```{r null-p, cache=TRUE}
291 | ## 100,000 t-tests under the null, n = 6
292 | n <- 6; m <- matrix(rnorm(n * 100000), ncol=n)
293 | P <- genefilter::rowttests(m, factor(rep(1:2, each=3)))$p.value
294 | quantile(P, c(.001, .01, .05))
295 | hist(P, breaks=20)
296 | ```
297 |
298 | _de novo_ hybrid strategies
299 |
300 | # Practical: Differential binding (`r Biocpkg("csaw")`)
301 |
302 | This exercise is based on the `r Biocpkg("csaw")` vignette, where more
303 | detail can be found.
304 |
305 | ## 1 - 4: Experimental Design ... Alignment
306 |
307 | The experiment involves changes in binding profiles of the NFYA
308 | protein between embryonic stem cells and terminal neurons. It is a
309 | subset of the data provided by Tiwari et
310 | al. [2012](https://doi.org/10.1038/ng.1036) available as
311 | [GSE25532](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25532). There
312 | are two es (embryonic stem cell) and two tn (terminal neuron)
313 | replicates. Single-end FASTQ files were extracted from GEO, aligned
314 | using `r Biocpkg("Rsubread")`, and post-processed (sorted and indexed)
315 | using `r Biocpkg("Rsamtools")` with the script available at
316 |
317 | ```{r csaw-preprocess, eval=FALSE}
318 | system.file(package="UseBioconductor", "scripts", "ChIPSeq", "NFYA",
319 | "preprocess.R")
320 | ```
321 |
322 | Create a data frame summarizing the files used.
323 |
324 | ```{r csaw-setup}
325 | files <- local({
326 | acc <- c(es_1="SRR074398", es_2="SRR074399", tn_1="SRR074417",
327 | tn_2="SRR074418")
328 | data.frame(Treatment=sub("_.*", "", names(acc)),
329 | Replicate=sub(".*_", "", names(acc)),
330 | sra=sprintf("%s.sra", acc),
331 | fastq=sprintf("%s.fastq.gz", acc),
332 | bam=sprintf("%s.fastq.gz.subread.BAM", acc),
333 | row.names=acc, stringsAsFactors=FALSE)
334 | })
335 | ```
336 |
337 | ## 5: Reduction
338 |
339 | Change to the directory where the BAM files are located
340 |
341 | ```{r csaw-setwd, eval=FALSE}
342 | setwd("~/UseBioconductor-data/ChIPSeq/NFYA")
343 | ```
344 |
345 | Load the csaw library and count reads in overlapping windows. This
346 | returns a `SummarizedExperiment`, so explore it a bit...
347 |
348 | ```{r csaw-reduction, eval=FALSE}
349 | library(csaw)
350 | library(GenomicRanges)
351 | frag.len <- 110
352 | system.time({
353 | data <- windowCounts(files$bam, width=10, ext=frag.len)
354 | }) # 156 seconds
355 | acc <- sub(".fastq.*", "", data$bam.files)
356 | colData(data) <- cbind(files[acc,], colData(data))
357 | ```
358 |
359 | ## 6: Analysis
360 |
361 | **Filtering** (vignette Chapter 3) Start by filtering low-count
362 | windows. There are likely to be many of these (how many?). Is there a
363 | rational way to choose the filtering threshold?
364 |
365 | ```{r csaw-filter, eval=FALSE}
366 | library(edgeR) # for aveLogCPM()
367 | keep <- aveLogCPM(assay(data)) >= -1
368 | data <- data[keep,]
369 | ```
370 |
371 | ```{r csaw-data-load, echo=FALSE}
372 | frag.len <- 110
373 | fl <- system.file(package="UseBioconductor", "extdata", "csaw-data-filtered.Rds")
374 | data <- readRDS(fl)
375 | ```
376 |
377 | **Normalization (composition bias)** (vignette Chapter 4) csaw uses
378 | binned counts in normalization. The bins are large relative to the
379 | ChIP peaks, on the assumption that the bins primarily represent
380 | non-differentially bound regions. The sample bin counts are normalized
381 | using the `r Biocpkg("edgeR")` TMM (trimmed median of M values) method
382 | seen in the RNASeq differential expression lab. Explore vignette
383 | chapter 4 for more on normalization (this is a useful resource when
384 | seeking to develop normalization methods for other protocols!).
385 |
386 | ```{r csaw-normalize, eval=FALSE}
387 | system.time({
388 | binned <- windowCounts(files$bam, bin=TRUE, width=10000)
389 | }) #139 second
390 | normfacs <- normalize(binned)
391 | ```
392 | ```{r csaw-normacs-load, echo=FALSE}
393 | fl <- system.file(package="UseBioconductor", "extdata", "csaw-normfacs.Rds")
394 | normfacs <- readRDS(fl)
395 | ```
396 |
397 | **Experimental design and Differential binding** (vignette Chapter 5)
398 | Differential binding will be assessed using `r Biocpkg("edgeR")`,
399 | where we need to specify the experimental design
400 |
401 | ```{r csaw-experimental-design}
402 | design <- model.matrix(~Treatment, colData(data))
403 | ```
404 |
405 | Apply a standard `r Biocpkg("edgeR")` work flow to identify
406 | differentially bound regions. Creatively explore the results.
407 |
408 | ```{r csaw-de}
409 | y <- asDGEList(data, norm.factors=normfacs)
410 | y <- estimateDisp(y, design)
411 | fit <- glmQLFit(y, design, robust=TRUE)
412 | results <- glmQLFTest(fit, contrast=c(0, 1))
413 | head(results$table)
414 | ```
415 |
416 | **Multiple testing** (vignette Chapter 6) The challenge is that FDR
417 | across all detected differentially bound _regions_ is what one is
418 | interested in, but what is immediately available is the FDR across
419 | differentially bound _windows_; region will often consist of multiple
420 | overlapping windows. As a first step, we'll take a 'quick and dirty'
421 | approach to identifying regions by merging 'high-abundance' windows
422 | that are within, e.g., 1kb of one another
423 |
424 | ```{r csaw-merge-windows}
425 | merged <- mergeWindows(rowRanges(data), tol=1000L)
426 | ```
427 |
428 | Combine test results across windows within regions. Several strategies
429 | are explored in section 6.5 of the vignette.
430 |
431 | ```{r csaw-combine-merged-tests}
432 | tabcom <- combineTests(merged$id, results$table)
433 | head(tabcom)
434 | ```
435 |
436 | Section 6.6 of the vignette discusses approaches to identifying the
437 | 'best' windows within regions.
438 |
439 | Finally, create a `GRangesList` that associated with two result tables
440 | and the genomic ranges over which the results were calculated.
441 |
442 | ```{r csaw-grangeslist}
443 | gr <- rowRanges(data)
444 | mcols(gr) <- as(results$table, "DataFrame")
445 | grl <- split(gr, merged$id)
446 | mcols(grl) <- as(tabcom, "DataFrame")
447 | ```
448 |
449 | ## Annotation
450 |
451 | ### csaw
452 |
453 | ### ChIPseeker
454 |
--------------------------------------------------------------------------------
/vignettes/GSEA/2012-07-06-Gentleman-GSEA.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/2012-07-06-Gentleman-GSEA.pdf
--------------------------------------------------------------------------------
/vignettes/GSEA/Category-ribosome.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/Category-ribosome.png
--------------------------------------------------------------------------------
/vignettes/GSEA/GSEA2011.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/GSEA2011.pdf
--------------------------------------------------------------------------------
/vignettes/GSEA/GSEAlm.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/GSEAlm.pdf
--------------------------------------------------------------------------------
/vignettes/GSEA/L8_Gene_Set_Testing.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/L8_Gene_Set_Testing.pdf
--------------------------------------------------------------------------------
/vignettes/GSEA/SPIA.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/SPIA.pdf
--------------------------------------------------------------------------------
/vignettes/GSEA/SPIA.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/SPIA.png
--------------------------------------------------------------------------------
/vignettes/GSEA/goseq.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/goseq.pdf
--------------------------------------------------------------------------------
/vignettes/GSEA/subramanian-F1-part.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/subramanian-F1-part.jpg
--------------------------------------------------------------------------------
/vignettes/GSEA/subramanian-F1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/subramanian-F1.jpg
--------------------------------------------------------------------------------
/vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2-cropped.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2-cropped.pdf
--------------------------------------------------------------------------------
/vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2-cropped.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2-cropped.png
--------------------------------------------------------------------------------
/vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2-cropped.tiff:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2-cropped.tiff
--------------------------------------------------------------------------------
/vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2.pdf
--------------------------------------------------------------------------------
/vignettes/I_LargeData.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: I. Working with Large Data
3 | author: Martin Morgan (mtmorgan@fredhutch.org)
4 | date: "`r Sys.Date()`"
5 | output:
6 | BiocStyle::html_document:
7 | toc: true
8 | vignette: >
9 | %\VignetteIndexEntry{I. Working with Large Data}
10 | %\VignetteEngine{knitr::rmarkdown}
11 | \usepackage[utf8]{inputenc}
12 | ---
13 |
14 | ```{r style, echo=FALSE, results='asis'}
15 | BiocStyle::markdown()
16 | suppressPackageStartupMessages({
17 | library(rtracklayer)
18 | library(BiocParallel)
19 | library(GenomicFiles)
20 | library(TxDb.Hsapiens.UCSC.hg19.knownGene)
21 | })
22 | ```
23 |
24 | # Scalabe computing
25 |
26 | Efficient _R_ code
27 |
28 | - Vectorize!
29 | - Reuse others' work -- `r Biocpkg("DESeq2")`,
30 | `r Biocpkg("GenomicRanges")`, `r Biocpkg("Biostrings")`, ...,
31 | `r CRANpkg("dplyr")`, `r CRANpkg("data.table")`, `r CRANpkg("Rcpp")`
32 | - Useful tools: `system.time()`, `Rprof()`, `r CRANpkg("microbenchmark")`
33 | - More detail in
34 | [deadly sins](http://bioconductor.org/help/course-materials/2014/CSAMA2014/1_Monday/labs/IntermediateR.html#efficient-code)
35 | of a previous course.
36 |
37 | Iteration
38 |
39 | - Chunk-wise
40 | - `open()`, read chunk(s), `close()`.
41 | - e.g., `yieldSize` argument to `Rsamtools::BamFile()`
42 |
43 | Restriction
44 |
45 | - Limit to columns and / or rows of interest
46 | - Exploit domain-specific formats, e.g., BAM files and
47 | `Rsamtools::ScanBamParam()`
48 | - Use a data base
49 |
50 | Sampling
51 |
52 | - Iterate through large data, retaining a manageable sample, e.g.,
53 | `ShortRead::FastqSampler()`
54 |
55 | Parallel evaluation
56 |
57 | - **After** writing efficient code
58 | - Typically, `lapply()`-like operations
59 | - Cores on a single machine ('easy'); clusters (more tedious);
60 | clouds
61 |
62 | # File management
63 |
64 | ## File classes
65 |
66 | | Type | Example use | Name | Package |
67 | |-------|-----------------------|-----------------------------|----------------------------------|
68 | | .bed | Range annotations | `BedFile()` | `r Biocpkg("rtracklayer")` |
69 | | .wig | Coverage | `WigFile()`, `BigWigFile()` | `r Biocpkg("rtracklayer")` |
70 | | .gtf | Transcript models | `GTFFile()` | `r Biocpkg("rtracklayer")` |
71 | | | | `makeTxDbFromGFF()` | `r Biocpkg("GenomicFeatures")` |
72 | | .2bit | Genomic Sequence | `TwoBitFile()` | `r Biocpkg("rtracklayer")` |
73 | | .fastq | Reads & qualities | `FastqFile()` | `r Biocpkg("ShortRead")` |
74 | | .bam | Aligned reads | `BamFile()` | `r Biocpkg("Rsamtools")` |
75 | | .tbx | Indexed tab-delimited | `TabixFile()` | `r Biocpkg("Rsamtools")` |
76 | | .vcf | Variant calls | `VcfFile()` | `r Biocpkg("VariantAnnotation")` |
77 |
78 | ```{r rtracklayer-file-classes}
79 | ## rtracklayer menagerie
80 | library(rtracklayer)
81 | names(getClass("RTLFile")@subclasses)
82 | ```
83 |
84 | Notes
85 |
86 | - Not a consistent interface
87 | - `open()`, `close()`, `import()` / `yield()` / `read*()`
88 | - Some: selective import via index (e.g., `.bai`, bam index);
89 | selection ('columns'); restriction ('rows')
90 |
91 | ## Managing a collection of files
92 |
93 | `*FileList()` classes
94 |
95 | - `reduceByYield()` -- iterate through a single large file
96 | - `bplapply()` (`r Biocpkg("BiocParallel")`) -- perform independent
97 | operations on several files, in parallel
98 |
99 | `GenomicFiles()`
100 |
101 | - 'rows' as genomic range restrictions, 'columns' as files
102 | - Each row x column is a _map_ from file data to useful representation
103 | in _R_
104 | - `reduceByRange()`, `reduceByFile()`: collapse maps into summary
105 | representation
106 | - see the GenomicFiles vignette
107 | [Figure 1](http://bioconductor.org/packages/devel/bioc/vignettes/GenomicFiles/inst/doc/GenomicFiles.pdf)
108 |
109 | # Parallel evaluation with BiocParallel
110 |
111 | Standardized interface for simple parallel evaluation.
112 |
113 | - `bplapply()` instead of `lapply()`
114 | - Argument `BPPARAM` influences how parallel evaluation occurs
115 |
116 | - `MulticoreParam()`: threads on a single (non-Windows) machine
117 | - `SnowParam()`: processes on the same or different machines
118 | - `BatchJobsParam()`: resource scheduler on a cluster
119 |
120 | Other resources
121 |
122 | - [Bioconductor Amazon AMI](http://bioconductor.org/help/bioconductor-cloud-ami/)
123 |
124 | - easily 'spin up' 10's of instances
125 | - Pre-configured with Bioconductor packages and StarCluster
126 | management
127 |
128 | - `r Biocpkg("GoogleGenomics")` to interact with google compute cloud
129 | and resources
130 |
131 |
132 | # Practical
133 |
134 | ### Efficient code
135 |
136 | Define following as a function.
137 |
138 | ```{r benchmark-f0}
139 | f0 <- function(n) {
140 | ## inefficient!
141 | ans <- numeric()
142 | for (i in seq_len(n))
143 | ans <- c(ans, exp(i))
144 | ans
145 | }
146 | ```
147 |
148 | Use `system.time()` to explore how long this takes to execute as `n`
149 | increases from 100 to 10000. Use `identical()` and
150 | `r CRANpkg("microbenchmark")` to compare alternatives `f1()`, `f2()`, and
151 | `f3()` for both correctness and performance of these three different
152 | functions. What strategies are these functions using?
153 |
154 | ```{r benchmark}
155 | f1 <- function(n) {
156 | ans <- numeric(n)
157 | for (i in seq_len(n))
158 | ans[[i]] <- exp(i)
159 | ans
160 | }
161 |
162 | f2 <- function(n)
163 | sapply(seq_len(n), exp)
164 |
165 | f3 <- function(n)
166 | exp(seq_len(n))
167 | ```
168 |
169 | ### Sleeping serially and in parallel
170 |
171 | Go to sleep for 1 second, then return `i`. This takes 8 seconds.
172 |
173 | ```{r parallel-sleep}
174 | library(BiocParallel)
175 |
176 | fun <- function(i) {
177 | Sys.sleep(1)
178 | i
179 | }
180 |
181 | ## serial
182 | f0 <- function(n)
183 | lapply(seq_len(n), fun)
184 |
185 | ## parallel
186 | f1 <- function(n)
187 | bplapply(seq_len(n), fun)
188 | ```
189 |
190 | ## Reads overlapping windows
191 |
192 | This exercise uses the following packages:
193 |
194 | ```{r csaw-packages}
195 | library(GenomicAlignments)
196 | library(GenomicFiles)
197 | library(BiocParallel)
198 | library(Rsamtools)
199 | library(GenomeInfoDb)
200 | ```
201 |
202 | This is a re-implementation of the basic `r Biocpkg("csaw")` binned
203 | counts algorithm. It supposes that ChIP fragment lengths are 110 nt,
204 | and that we bin coverage in windows of width 50. We focus on chr1.
205 |
206 | ```{r olaps-chr}
207 | frag.len <- 110
208 | spacing <- 50
209 | chr <- "chr1"
210 | ```
211 |
212 | Here we point to the bam files, indicating that we'll process the
213 | files in chunks of size 1,000,000.
214 |
215 | ```{r olaps-tileGenome}
216 | fls <- dir("~/UseBioconductor-data/ChIPSeq/NFYA/", ".BAM$", full=TRUE)
217 | names(fls) <- sub(".fastq.*", "", basename(fls))
218 | bfl <- BamFileList(fls, yieldSize=1000000)
219 | ```
220 |
221 | We'll creating the counting bins using `tileGenome()`, focusing the
222 | 'standard' chromosomes'
223 |
224 | ```{r csaw-tiles}
225 | len <- seqlengths(keepStandardChromosomes(seqinfo(bfl)))[chr]
226 | tiles <- tileGenome(len, tilewidth=spacing, cut.last.tile.in.chrom=TRUE)
227 | ```
228 |
229 | We'll use `reduceByYield()` to iterate through a single file. We read
230 | to tell this function we'll `YIELD` a chunk of the file, how we'll
231 | `MAP` the chunk from it's input representation to the per-window
232 | counts, and finally how we'll `REDUCE` successive chunks into a final
233 | representation.
234 |
235 | `YIELD` is supposed to be a function that takes one argument, the
236 | input source, and returns a chunk of records
237 |
238 | ```{r yield}
239 | yield <- function(x, ...)
240 | readGAlignments(x)
241 | ```
242 |
243 | `MAP` must take the output of yield and perhaps additional arguments,
244 | and return a vector of counts. We'll resize the genomic ranges
245 | describing the alignment so that they have a width equal to the
246 | fragment length
247 |
248 | ```{r map}
249 | map <- function(x, tiles, frag.len, ...) {
250 | gr <- keepStandardChromosomes(granges(x))
251 | countOverlaps(tiles, resize(gr, frag.len))
252 | }
253 | ```
254 |
255 | `REDUCE` takes two results from `MAP` (in our case, vectors of counts)
256 | and combines them into a single result. We simply add our vectors (`+`
257 | is actually a function!)
258 |
259 | ```{r reduce}
260 | reduce <- `+`
261 | ```
262 |
263 | To process one file, we use `reduceByYield()`, passing the file we
264 | want to process, the yield, map, and reduce functions. Our 'wrapper'
265 | function passes any additional arguments through to `reduceByYield()`
266 | using `...`:
267 |
268 | ```{r reduceByYield}
269 | count1file <- function(bf, ...)
270 | reduceByYield(bf, yield, map, reduce, ...)
271 | ```
272 |
273 | Using `yieldSize` and `reduceByYield()` means that we do not consume
274 | too much memory processing each file, so that we can process files in
275 | parallel using `r Biocpkg("BiocParallel")`. The `simplify2array()`
276 | function transforms a list-of-vectors to a matrix.
277 |
278 | ```{r count-overlaps-parallel, eval=FALSE}
279 | counts <- bplapply(bfl, count1file, tiles=tiles, frag.len=frag.len)
280 | counts <- simplify2array(counts)
281 | dim(counts)
282 | colSums(counts)
283 | ```
284 |
285 | # Resources
286 |
287 | - Lawrence, M, and Morgan, M. 2014. Scalable Genomics with R and
288 | Bioconductor. Statistical Science 2014, Vol. 29, No. 2,
289 | 214-226. http://arxiv.org/abs/1409.2864v1
290 |
291 | [BiocParallel]: http://bioconductor.org/packages/release/bioc/html/BiocParallel.html
292 | [GenomicFiles]: http://bioconductor.org/packages/release/bioc/html/GenomicFiles.html
293 |
--------------------------------------------------------------------------------
/vignettes/our_figures/ChIPSeq-workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/ChIPSeq-workflow.png
--------------------------------------------------------------------------------
/vignettes/our_figures/ChIPSeq_nbt-1508-F1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/ChIPSeq_nbt-1508-F1.jpg
--------------------------------------------------------------------------------
/vignettes/our_figures/FilesToPackages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/FilesToPackages.png
--------------------------------------------------------------------------------
/vignettes/our_figures/GRanges.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/GRanges.png
--------------------------------------------------------------------------------
/vignettes/our_figures/GRangesImplementation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/GRangesImplementation.png
--------------------------------------------------------------------------------
/vignettes/our_figures/GRangesList.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/GRangesList.png
--------------------------------------------------------------------------------
/vignettes/our_figures/GRangesListImplementation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/GRangesListImplementation.png
--------------------------------------------------------------------------------
/vignettes/our_figures/RangeOperations.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/RangeOperations.png
--------------------------------------------------------------------------------
/vignettes/our_figures/SequencingEcosystem.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/SequencingEcosystem.png
--------------------------------------------------------------------------------
/vignettes/our_figures/SummarizedExperiment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/SummarizedExperiment.png
--------------------------------------------------------------------------------
/vignettes/our_figures/batch_effects_nrg2825-f2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/batch_effects_nrg2825-f2.jpg
--------------------------------------------------------------------------------
/vignettes/our_figures/copy_number_QC_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/copy_number_QC_2.png
--------------------------------------------------------------------------------
/vignettes/our_figures/journal.pcbi.1003118.t001.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/journal.pcbi.1003118.t001.png
--------------------------------------------------------------------------------
/vignettes/our_figures/nmeth.3252-F2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/nmeth.3252-F2.jpg
--------------------------------------------------------------------------------
/vignettes/our_figures/nrg2825-f2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/nrg2825-f2.jpg
--------------------------------------------------------------------------------