├── .gitignore ├── DESCRIPTION ├── NAMESPACE ├── README.md ├── inst ├── doc │ └── GRanges_and_GRangesList_slides.pdf ├── extdata │ ├── E-MTAB-1147-toptable.csv │ ├── csaw-data-filtered.Rds │ └── csaw-normfacs.Rds └── script │ └── ChIPSeq │ └── NFYA │ ├── csaw.R │ ├── preprocess_NFYA.R │ └── setup.R └── vignettes ├── A_Introduction.Rmd ├── B_GenomicRanges.Rmd ├── C_DifferentialExpression.Rmd ├── D_MachineLearning.Rmd ├── E_GeneSetEnrichment.Rmd ├── F_ChIPSeq.Rmd ├── GSEA ├── 2012-07-06-Gentleman-GSEA.pdf ├── Category-ribosome.png ├── GSEA2011.pdf ├── GSEAlm.pdf ├── L8_Gene_Set_Testing.pdf ├── SPIA.pdf ├── SPIA.png ├── goseq.pdf ├── subramanian-F1-part.jpg ├── subramanian-F1.jpg ├── young-et-al-gb-2010-11-2-r14-2-cropped.pdf ├── young-et-al-gb-2010-11-2-r14-2-cropped.png ├── young-et-al-gb-2010-11-2-r14-2-cropped.tiff └── young-et-al-gb-2010-11-2-r14-2.pdf ├── I_LargeData.Rmd └── our_figures ├── ChIPSeq-workflow.png ├── ChIPSeq_nbt-1508-F1.jpg ├── FilesToPackages.png ├── GRanges.png ├── GRangesImplementation.png ├── GRangesList.png ├── GRangesListImplementation.png ├── RangeOperations.png ├── SequencingEcosystem.png ├── SummarizedExperiment.png ├── batch_effects_nrg2825-f2.jpg ├── copy_number_QC_2.png ├── journal.pcbi.1003118.t001.png ├── nmeth.3252-F2.jpg └── nrg2825-f2.jpg /.gitignore: -------------------------------------------------------------------------------- 1 | vignettes/*R 2 | vignettes/*html 3 | vignettes/*\.md 4 | vignettes/AMakefile -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: UseBioconductor 2 | Type: Package 3 | Title: Use R / Bioconductor for Sequence Analysis 4 | Version: 0.1.0 5 | Authors@R: c(person("Martin", "Morgan", role=c("aut", "cre"), 6 | email="mtmorgan@fhcrc.org"), 7 | person("Herve", "Pages", role="aut")) 8 | License: Artistic-2.0 9 | VignetteBuilder: knitr 10 | Description: This course is directed at intermediate users wanting to 11 | make effective use of R / Bioconductor for the analysis and 12 | comprehension of high-throughput sequence data using R and 13 | Bioconductor. The README.md file in the root directory of the 14 | package gitub repository outlines detailed content. The course 15 | combines lectures with extensive hands-on practicals; participants 16 | are required to bring a laptop with wireless internet access and a 17 | modern version of the Chrome or Safari web browser. 18 | Imports: devtools 19 | Suggests: ALL, AnnotationHub, BSgenome.Hsapiens.UCSC.hg19, 20 | BiocInstaller, BiocParallel, BiocStyle, Biostrings, ChIPseeker, 21 | DESeq2, GO.db, GenomicAlignments, GenomicRanges, GenomicFeatures, 22 | GenomicFiles, Gviz, MLSeq, PoiClaClu, RColorBrewer, 23 | RNAseqData.HNRNPC.bam.chr14, Rsamtools, ShortRead, 24 | TxDb.Hsapiens.UCSC.hg19.knownGene, VariantAnnotation, airway, 25 | class, cn.mops, csaw, dendextend, e1071, edgeR, fission, 26 | genefilter, ggplot2, goseq, gplots, httr, kernlab, knitr, limma, 27 | microbenchmark, org.Hs.eg.db, rtracklayer, shiny, sva, xtable 28 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | exportPattern("^[^\\.]") 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Use R / Bioconductor for Sequence Analysis 2 | ========================================== 3 | 4 | Fred Hutchinson Cancer Research Center, Seattle, WA
5 | 6-7 April, 2015 6 | 7 | Contact: Martin Morgan 8 | ([mtmorgan@fredhutch.org](mailto:mtmorgan@fredhutch.org)) 9 | 10 | This **INTERMEDIATE** course is designed for individuals comfortable 11 | using _R_, and with some familiarity with _Bioconductor_. It consists 12 | of approximately equal parts lecture and practical sessions addressing 13 | use of _Bioconductor_ software for analysis and comprehension of 14 | high-throughput sequence and related data. Specific topics include use 15 | of central Bioconductor classes (e.g., _GRanges_, 16 | _SummarizedExperiment_), RNASeq gene differential expression, ChIP-seq 17 | and methylation work flows, approaches to management and integrative 18 | analysis of diverse high-throughput data types, and strategies for 19 | working with large data. Participants are required to bring a laptop 20 | with wireless internet access and a modern version of the Chrome or 21 | Safari web browser. 22 | 23 | Registration 24 | ------------ 25 | 26 | Please register [online](https://register.bioconductor.org/Seattle-Apr-2015/). 27 | 28 | Schedule (tentative) 29 | -------------------- 30 | 31 | Day 1 (9:00 - 12:30; 1:30 - 5:00) 32 | 33 | - [A. Introduction](vignettes/A_Introduction.Rmd). _Bioconductor_ and 34 | sequencing work flows 35 | - [B. Genomic Ranges](vignettes/B_GenomicRanges.Rmd). Working with Genomic 36 | Ranges and other _Bioconductor_ data structures (e.g., in the 37 | [GenomicRanges](http://bioconductor.org/packages/devel/bioc/html/GenomicRanges.html). 38 | package). 39 | - [C. Differential Gene Expression](vignettes/C_DifferentialExpression.Rmd). RNA-Seq 40 | known gene differential expression with 41 | [DESeq2](http://bioconductor.org/packages/devel/bioc/html/DESeq2.html) 42 | and 43 | [edgeR](http://bioconductor.org/packages/devel/bioc/html/edgeR.html). 44 | 45 | Day 2 (9:00 - 12:30; 1:30 - 5:00) 46 | 47 | - [D. Machine Learning](vignettes/D_MachineLearning.Rmd). 48 | - [E. Gene Set Enrichment](vignettes/E_GeneSetEnrichment.Rmd). 49 | - [F. ChIP-seq](vignettes/F_ChIPSeq.Rmd) ChIP-seq with 50 | [csaw](http://bioconductor.org/packages/devel/bioc/html/csaw.html) 51 | - [I. Large Data](vignettes/I_LargeData.Rmd) -- efficient, parallel, and cloud 52 | programming with 53 | [BiocParallel](http://bioconductor.org/packages/devel/bioc/html/BiocParallel.html), 54 | [GenomicFiles](http://bioconductor.org/packages/devel/bioc/html/GenomicFiles.html), 55 | and other resources. 56 | 57 | -------------------------------------------------------------------------------- /inst/doc/GRanges_and_GRangesList_slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/inst/doc/GRanges_and_GRangesList_slides.pdf -------------------------------------------------------------------------------- /inst/extdata/E-MTAB-1147-toptable.csv: -------------------------------------------------------------------------------- 1 | "","baseMean","log2FoldChange","lfcSE","stat","pvalue","padj" 2 | "3183",1432.40015979451,-4.52349712874567,0.142649929348813,-31.7104757737709,1.1142183172124e-220,4.99169806111155e-218 3 | "91828",58.2842234383409,-3.47820360534797,0.375671633985405,-9.25862719111525,2.07077412382055e-20,8.43369824974187e-19 4 | "81537",1106.56971762968,-2.75754768126593,0.130103490989373,-21.1950322031802,1.06126034080731e-99,2.37722316340838e-97 5 | "4776",97.461818194366,-2.24586186354094,0.255656305034135,-8.78469186684474,1.56793739271017e-18,4.94302089387694e-17 6 | "283624",29.5352007313394,-2.24001491381734,0.405813908150946,-5.51980814069129,3.39369993363115e-08,4.34393591504787e-07 7 | "4053",135.522149417936,-2.11428205785264,0.24084465907166,-8.77861301139987,1.6550293171463e-18,4.94302089387694e-17 8 | "85446",82.5670990089107,-2.00352501351428,0.263740884783949,-7.59656590655457,3.04092459703197e-14,7.17018010247538e-13 9 | "10484",1400.24185534692,-1.85837786734961,0.121197313100289,-15.3334906510001,4.56766676123494e-53,6.82104903011085e-51 10 | "55701",344.20811516999,-1.73961736041058,0.163991210221822,-10.6079914774548,2.73566503516169e-26,1.75082562250348e-24 11 | "1112",413.054950570526,1.71225867245646,0.144915703196688,11.8155495552644,3.2442057374712e-32,2.9068083407742e-30 12 | "26020",5310.95322880689,-1.70905823813073,0.143517145685143,-11.9083906662983,1.07022910135238e-32,1.19865659351466e-30 13 | "6038",32.4914429792246,-1.69968784664343,0.366391606851103,-4.63899231003443,3.50112079027611e-06,2.58385770343617e-05 14 | "57156",5.40571279381349,-1.67288939044111,0.661374876801022,-2.52941175892959,0.0114253899724891,0.0271297790944442 15 | "731223",1.9671026203033,-1.6631520358834,0.729790796053574,-2.27894356146595,0.0226704204780558,NA 16 | "5228",24.9520902859294,-1.66212943841284,0.418983607463788,-3.9670512373363,7.27673532212441e-05,0.00037906714236183 17 | "100289511",87.8236602256995,-1.56877555472898,0.279314024795923,-5.61652983904117,1.94830617205936e-08,2.64497322752301e-07 18 | "117153",2.51057825696152,-1.56229986614111,0.728707809352641,-2.14393182849105,0.0320383494699092,0.0639014161814284 19 | "64207",837.459611647065,-1.55593023990358,0.140965473336716,-11.0376690339416,2.51472224057808e-28,1.87765927296496e-26 20 | "51062",20.0323159397424,-1.55388086562808,0.435311134374935,-3.56958676891944,0.000357544779217963,0.00146954184485915 21 | "145553",4.81655530302247,-1.54751717547855,0.671233003068312,-2.30548433763627,0.0211394629389538,0.045751108196383 22 | "10243",338.746609584515,-1.54194126450271,0.156307175438636,-9.86481433226364,5.9143766881063e-23,3.31205094533953e-21 23 | "57523",53.2587591296687,-1.50299458765496,0.299482795871286,-5.01863415319833,5.2040146572385e-07,4.69605248272374e-06 24 | "83982",10.8825096069583,-1.48573332331613,0.540152626758592,-2.75058057614545,0.0059489758844623,0.0164514888656735 25 | "623",4.44636529012901,-1.45389193760792,0.673415172028592,-2.1589830434444,0.0308514824660318,0.0619796598420729 26 | "23428",465.75556990637,-1.44283570918619,0.152805630779627,-9.44229412113101,3.64705388049678e-21,1.63388013846256e-19 27 | "80757",3.97991438538255,-1.40720034199387,0.691136219604602,-2.03606800233813,0.0417435272586461,0.0799192316746729 28 | "55320",238.966477802658,1.38896184417887,0.202430797920733,6.86141564646082,6.81814611029793e-12,1.32805628583194e-10 29 | "79686",22.1405377031261,1.38080312985046,0.387682207339192,3.56168816548851,0.00036847788118727,0.00150070991610816 30 | "2353",180.28473241364,-1.37714962702149,0.219469882000525,-6.2748911808236,3.49878909654846e-10,5.80539820464336e-09 31 | "100288846",6.13470711409441,-1.3658476601296,0.624759870613376,-2.18619620813456,0.0288012535736109,0.0586498254589895 32 | "400221",0.974675560282545,-1.34592932682023,0.694561692801581,-1.9378110551869,0.0526462785117864,NA 33 | "55812",15.7757619629031,-1.32903981572061,0.467486530113565,-2.84294782867381,0.00446983865153918,0.0128364597172407 34 | "3306",946.728724035196,-1.30836085245296,0.153920421934333,-8.50024211219445,1.8919567401043e-17,5.29747887229204e-16 35 | "5583",10.2596742151672,-1.3059745576008,0.541578416948674,-2.41142282766516,0.0158904143243273,0.0360168753156853 36 | "122769",177.230285283856,1.30022612451547,0.171848110281372,7.56613571360531,3.84490364494892e-14,8.61258416468559e-13 37 | "9369",327.512279792085,-1.2986791099135,0.153417364982469,-8.46500726995215,2.56136714674227e-17,6.74995577494434e-16 38 | "25938",2940.80984818683,-1.24943045176459,0.128083761393273,-9.75479200621142,1.75961002141004e-22,8.75894766212997e-21 39 | "7080",39.9797438561718,-1.23377091972103,0.321577710150241,-3.83661827539172,0.000124740133499521,0.000576119379461705 40 | "9472",71.9883051049858,-1.22309036440844,0.251412913655463,-4.86486691007675,1.14533850830321e-06,9.32930275854249e-06 41 | "90668",14.4440661626902,-1.22071567267618,0.483546072224984,-2.52450747259552,0.0115860541702236,0.0273186961487376 42 | "145447",2.09685833455682,1.19563219242626,0.73157338941382,1.63432980166797,0.102189619324906,NA 43 | "5687",62.354573094425,-1.1872684016993,0.287257676434532,-4.13311287773328,3.57882850787495e-05,0.000208222749549088 44 | "283547",2.40093478925935,-1.18083334540971,0.726772496244827,-1.6247633908974,0.104212984371431,NA 45 | "7253",0.879171747422991,-1.17519089359932,0.675137230259378,-1.74066966081522,0.0817414992702238,NA 46 | "122416",415.961946881745,-1.17435665390991,0.355753515247696,-3.30104019658739,0.000963270808372096,0.0034249628742119 47 | "84334",54.2512832894478,1.1433387891006,0.260016835963868,4.39717214795845,1.09670361316364e-05,7.22534145143106e-05 48 | "97",90.1519744789873,1.12694944145441,0.217735052457569,5.17578326840149,2.26956797454968e-07,2.25603810221508e-06 49 | "196872",15.7593489679742,-1.12367315956595,0.444350633099326,-2.52879837647215,0.0114453755554687,0.0271297790944442 50 | "122616",113.349294456021,-1.11472480377903,0.22260554875543,-5.00762361949812,5.5106158935075e-07,4.84069788292424e-06 51 | "145482",46.4020593503469,1.1088341336381,0.278308701288859,3.98418780477595,6.77113089376902e-05,0.000361126981001014 52 | "89932",681.292332079321,-1.09389329764706,0.142895654664534,-7.65518937727752,1.93027575976792e-14,4.80424189097793e-13 53 | "79446",200.297567743048,-1.08659232599486,0.171699779013473,-6.32844335757474,2.47646729534503e-10,4.43782939325829e-09 54 | "145483",26.2372438697388,-1.0819372654162,0.394151208445639,-2.74498020615716,0.00605145475255693,0.0166322191972117 55 | "113146",8091.8204447092,-1.06040850099806,0.117993155913062,-8.9870339749144,2.53992126641034e-19,8.75295944116794e-18 56 | "3320",9589.96294823561,1.05902675815388,0.117077327549075,9.04553238721627,1.489376137666e-19,5.56033758061975e-18 57 | "2287",149.792366848785,-1.05205839830652,0.232812801346378,-4.51890270733556,6.21609492662488e-06,4.42033417004436e-05 58 | "440193",553.897351675936,1.04518810852461,0.149294487265307,7.00084864263766,2.54416682354357e-12,5.18084880430691e-11 59 | "256369",0.701899246796241,-1.02704072659222,0.654302908901173,-1.56967164996563,0.116491519980812,NA 60 | "1033",151.151456683266,1.02235205325258,0.194235517821479,5.26346604740035,1.41364624015627e-07,1.50788932283336e-06 61 | "7011",393.149102247071,-1.01811559110806,0.151810213548844,-6.70650259496861,1.99344304318901e-11,3.72109368061948e-10 62 | "376267",20.9890059390901,1.00660938767154,0.391700951193329,2.56984157073114,0.0101745033734847,0.0248959000389302 63 | "55333",236.474895409753,-0.990285186257954,0.184922192044451,-5.35514518462939,8.54877463379526e-08,9.34110008765921e-07 64 | "9495",11.2550212796222,0.968066197098626,0.512272398980966,1.88974888950555,0.0587915521164752,0.104105199004668 65 | "122786",78.9553417891622,-0.954789277524657,0.24946765726316,-3.82730686614603,0.000129552963347783,0.000592242118161295 66 | "387990",6.16710103651015,0.952202002872035,0.606639241618118,1.56963469809863,0.116500121331762,0.189101646219671 67 | "5083",108.475141844886,0.91754560883159,0.220373762126575,4.16358826013319,3.13284844360007e-05,0.000187135480364377 68 | "57452",0.593146932225047,-0.910528037565311,0.633755269119884,-1.43671868611016,0.150797943220308,NA 69 | "122525",3.52735441658135,-0.887435057422969,0.697239620683527,-1.2727834608036,0.203094891709724,0.303288371619855 70 | "55102",412.607879129318,-0.880915780433987,0.140260751881845,-6.28055795092319,3.37360099543862e-10,5.80539820464336e-09 71 | "51016",278.963431261046,-0.876459445298556,0.171960749160328,-5.09685756533537,3.45337959606685e-07,3.29173204050628e-06 72 | "283596",77.4942204410022,0.875541553881386,0.232175763794369,3.77102906682726,0.000162575732903335,0.000714058120987196 73 | "60485",235.494633614731,0.864454530417008,0.167142424890047,5.17196355734147,2.3164676942387e-07,2.25603810221508e-06 74 | "9787",449.732534427427,0.852685479266412,0.168171590857491,5.07033010105124,3.97126295542903e-07,3.70651209173376e-06 75 | "9870",1192.41922715277,-0.849621050535841,0.118733863130534,-7.15567596416687,8.3261609148454e-13,1.77624766183369e-11 76 | "55727",396.983610074766,0.847078271118886,0.148688222217879,5.69700988069942,1.21926864055903e-08,1.70697609678264e-07 77 | "643866",29.8986180482568,-0.841230916054162,0.339840518976871,-2.47536967806778,0.0133098327193585,0.0305784874783211 78 | "10965",407.735439529025,-0.838800199151072,0.156233440131432,-5.36889028651885,7.9222592850066e-08,8.8729303992074e-07 79 | "22863",204.054348536117,0.836528646902067,0.180364617893545,4.63798641148013,3.51819910512514e-06,2.58385770343617e-05 80 | "55745",166.358100525616,-0.827760294086051,0.188966118115341,-4.3804693790704,1.18423934203653e-05,7.68897427872991e-05 81 | "79609",45.6179984660877,0.82409110067801,0.287266822033321,2.8687305232291,0.00412122756848354,0.0120673852985662 82 | "1397",1966.60401305455,-0.822542701404016,0.168169908806144,-4.89114079470778,1.00253217134125e-06,8.47423420303551e-06 83 | "4253",154.936826325271,-0.815084188732455,0.212228713883587,-3.84059335712485,0.000122737283887363,0.000572773991474363 84 | "161394",5.22258593152774,-0.806396187420136,0.637288592668533,-1.2653548120852,0.20574416610269,0.305209888788096 85 | "2342",24.5550647317514,0.806091130810016,0.390961291856742,2.06181826078421,0.0392250418626059,0.0757449084243425 86 | "55644",385.073371542473,0.800991096574132,0.147672584132224,5.42410157769659,5.82467535549846e-08,6.86698568227187e-07 87 | "283554",59.938691282558,-0.796349135190337,0.267164280867933,-2.98074702427753,0.00287546221876165,0.00876331342860692 88 | "11099",468.355626236144,0.79598401810395,0.132593638853168,6.00318405157739,1.9348507539127e-09,2.96643849785836e-08 89 | "55775",317.34226625771,0.795795937459591,0.144087883201904,5.52298999593516,3.33278718861787e-08,4.34393591504787e-07 90 | "10516",313.360473673507,-0.788295129976791,0.157116393502735,-5.01726848741007,5.24113000303989e-07,4.69605248272374e-06 91 | "64423",3154.45957727645,-0.783851075312696,0.165179220758605,-4.74545812550008,2.08034949164283e-06,1.63508170571226e-05 92 | "123036",12.0736511733874,0.781246766183047,0.498774220877891,1.56633349014706,0.117270564564647,0.189665028609971 93 | "161436",32.6827330441306,0.780582192381665,0.334955368955764,2.33040656973244,0.0197846728755009,0.0432367485279239 94 | "122706",0.477524391504373,-0.777838058677479,0.605830758098615,-1.28391972226485,0.199170045725382,NA 95 | "1053",0.500433314666908,-0.777667505227457,0.605897412145765,-1.28349699080802,0.199318013158563,NA 96 | "8812",589.050177970424,0.767404264778082,0.128940889603387,5.95159741132982,2.65537899966118e-09,3.83745094144584e-08 97 | "57820",461.320081323437,0.761338992960825,0.138424726432583,5.50002165495788,3.79744608355319e-08,4.72571068175508e-07 98 | "57570",181.225063317447,0.759943670172124,0.193798351453344,3.92131132423527,8.80683916950211e-05,0.000448348175901926 99 | "51562",80.3712155569268,0.744997291414789,0.265466738223546,2.80636774460022,0.00501034754741829,0.0142065550711607 100 | "645431",1.10140075018282,-0.738651270256046,0.721669652787289,-1.02353101228958,0.30605684462904,NA 101 | "1241",184.282018219079,0.737599497791902,0.175269626758216,4.20837033452133,2.5721906119404e-05,0.00016230160480976 102 | "1965",1002.8175409666,0.725176694297716,0.132863688005995,5.45805031593732,4.81391256493193e-08,5.82873737591758e-07 103 | "8814",65.9585869441561,-0.724716803593119,0.272062047371212,-2.66379236132222,0.00772652383936536,0.0201248521967621 104 | "10572",1533.9592112381,-0.72335246379959,0.172798275562376,-4.1861092736338,2.8377677144071e-05,0.00017415341589786 105 | "161357",3.91093614010412,0.720132613320407,0.689070596739257,1.04507813383438,0.295986859209724,0.407549440212015 106 | "23351",1551.62887257246,-0.708929251855538,0.118176328279823,-5.99891079858991,1.98645435124443e-09,2.96643849785836e-08 107 | "123099",6.91419698610375,0.706087020640966,0.62433958513667,1.13093425028689,0.258082766187041,0.362448524300296 108 | "51804",378.731637143684,0.701938692448754,0.14386571196158,4.87912430889863,1.0655789055921e-06,8.84035832787521e-06 109 | "55038",540.199366046281,0.697281116634237,0.166471422414359,4.18859349263357,2.80688672128689e-05,0.00017415341589786 110 | "7443",115.006428831965,0.69542916816118,0.212805569550322,3.26790868129385,0.00108345317494672,0.0037537156353444 111 | "60686",216.029900967751,0.695151900366022,0.168837524944053,4.11728317266184,3.83364921970008e-05,0.000216986144434838 112 | "100529261",15.2191803823681,0.68839581831613,0.435485375543253,1.58075530655278,0.113933997210282,0.18669728198366 113 | "90673",48.104431834993,-0.688274871515456,0.280867245112756,-2.4505344909092,0.0142644295788307,0.0326044104658987 114 | "81892",395.803074661241,0.687404960674608,0.139450638959126,4.92937835068718,8.24916818652695e-07,7.10697566839244e-06 115 | "55237",0.244604444377625,0.683497037359114,0.493198534099321,1.38584563842493,0.165794043181846,NA 116 | "728635",12.680881774761,-0.681218636551113,0.497535888519006,-1.36918492167242,0.170941478081073,0.260812538585099 117 | "9056",8.49718166095661,-0.677819304598832,0.560715398208492,-1.20884731677512,0.226721509350867,0.331932144409113 118 | "10598",2014.19504407581,0.671376830801339,0.10847405537726,6.18928487983942,6.04377703812324e-10,9.67004326099718e-09 119 | "400236",21.1880920520839,-0.668792765908579,0.416035525781448,-1.60753763672554,0.10793648008197,0.178537030374329 120 | "145376",1.52610582752836,0.663388347949224,0.737712673122482,0.899250307225076,0.368519349660451,NA 121 | "79696",9.65485101596621,-0.66106540118584,0.521843539707438,-1.26678851204415,0.205230920885145,0.305209888788096 122 | "10668",56.0637010937899,-0.660720344312448,0.288840443937058,-2.28749248306939,0.0221670936916351,0.0472897998754883 123 | "7051",140.268301368179,0.656900793140942,0.209462054466489,3.13613267478972,0.00171191738870911,0.00567054586670355 124 | "55030",832.784271522788,0.655091882056005,0.125664378037104,5.21302768762823,1.85783154362855e-07,1.93560123615254e-06 125 | "8747",50.8933247002735,-0.649388157655921,0.294494022406118,-2.20509792473951,0.0274472236855753,0.0561477452563367 126 | "280655",14.0014993486591,0.642737006948317,0.449238967375927,1.43072407699324,0.152509309957384,0.23641581612771 127 | "317762",1333.95111831557,0.638991793463764,0.13545977572899,4.71720693486291,2.39104461061397e-06,1.8468758371639e-05 128 | "91875",215.506706862087,0.627640208822662,0.161642032564476,3.88290223071965,0.000103217077512728,0.000491928199209595 129 | "26257",8.93400191991033,-0.620317291620155,0.535309279744913,-1.15880167800519,0.246537033382495,0.351566814757137 130 | "84684",0.777429071841636,0.619721599610084,0.659250063011809,0.940040258439814,0.347196910799642,NA 131 | "55671",625.836442042281,0.619050102819896,0.138579058003344,4.46712592609022,7.92774683923988e-06,5.45491210963185e-05 132 | "91750",72.0802359471852,0.618551207534182,0.234714480713742,2.63533466556146,0.00840543635060401,0.0215179170575463 133 | "8846",211.90687817786,0.618441499314596,0.158628478653727,3.89867887887019,9.67189383184652e-05,0.000476154773260136 134 | "80224",136.679724606725,0.617932538045701,0.203913144328084,3.0303712891185,0.00244253268868996,0.00761704448367126 135 | "80821",349.527185042906,0.617517411492699,0.149771060283887,4.12307564840771,3.73846853307419e-05,0.000214722295232979 136 | "57143",234.321216262903,-0.616132452895485,0.15772102838788,-3.90646991839442,9.36542963723545e-05,0.000466971873614538 137 | "83694",200.19809480727,0.615634577199985,0.18447082191817,3.33730055950571,0.000845964115987908,0.00308123515416734 138 | "161176",107.131534234867,0.613791305293965,0.207773831236046,2.95413191181259,0.00313549949412992,0.00942754210315575 139 | "122953",370.418506971415,0.609465408218364,0.148786535102672,4.09624034727197,4.19913975953452e-05,0.000232248717564378 140 | "25983",339.856444832978,0.607076434794237,0.147534004366222,4.11482381571707,3.87475257919354e-05,0.000216986144434838 141 | "100628307",14.6428787000668,-0.603863968450644,0.478656505942236,-1.26158103139523,0.207099585251056,0.306206647499911 142 | "56936",3.28985037391207,-0.599751783119829,0.70903649767476,-0.84586870363751,0.397625993173909,0.506229841174152 143 | "80017",95.1364434285455,-0.597380740352245,0.219089659543622,-2.72664963557216,0.0063980935334488,0.0173717933514246 144 | "6547",0.971363310048689,-0.594641566962505,0.712384317010624,-0.834720182299629,0.403875275463728,NA 145 | "100129794",21.0913652411075,-0.583254721907116,0.400852458228242,-1.45503591143007,0.145659319921442,0.228165647988833 146 | "54813",118.255348287161,-0.581245341463074,0.207417604146694,-2.80229512752445,0.00507404324876383,0.0142966753172717 147 | "79038",711.794187477751,0.579991868971388,0.145982261174355,3.97302976612117,7.09641770811084e-05,0.000374022956851018 148 | "9252",31.4082058725744,-0.578850832813967,0.347150886586917,-1.66743296698751,0.0954283431499982,0.161327915966789 149 | "652",155.215285316105,-0.575029771356739,0.186175507044211,-3.0886435089455,0.00201072555057042,0.00638868827415282 150 | "22990",4575.41546492155,-0.57376866347385,0.106085275516115,-5.4085608081085,6.35332166094378e-08,7.29817462590465e-07 151 | "11169",517.390335277691,0.57216352716212,0.14784733663199,3.86996167936595,0.000108852461787257,0.000513325293480957 152 | "26153",1508.85595650182,-0.571201924099784,0.192860958129978,-2.96172916301092,0.00305916747270283,0.00926018262007343 153 | "22938",389.279138150234,0.568098591424289,0.145440130018899,3.90606493098204,9.38113138957778e-05,0.000466971873614538 154 | "51109",2692.50482148618,-0.563820241593268,0.108550083857944,-5.19410231254288,2.05709995012853e-07,2.09450176740359e-06 155 | "9529",647.300394518713,0.562853000726294,0.125326221062931,4.49110326596112,7.0855184563955e-06,4.95986291947685e-05 156 | "5527",360.78940204713,0.562144445921013,0.147928213395034,3.8001165093493,0.000144628073814914,0.000654478556253347 157 | "317760",1.11984843894904,0.559663437011084,0.72986255237593,0.76680662021802,0.443196499682034,NA 158 | "57475",314.851584098013,0.55574832781228,0.148274744914798,3.74809835708451,0.000178180384660495,0.000767546272383669 159 | "2972",525.952338579188,-0.55393523477008,0.153952289564633,-3.59809676320223,0.000320554365092163,0.0013421341641242 160 | "8111",0.720793642787434,-0.552731884977695,0.685469876801436,-0.806354741009002,0.420038335262208,NA 161 | "56413",65.058851443831,0.552356606227336,0.260350311699041,2.12158995555899,0.0338721866052502,0.066265238424245 162 | "6710",586.584335338001,0.549730360306443,0.132562449619739,4.14695384615605,3.36927947886522e-05,0.000198610158754161 163 | "64093",11602.1575379348,-0.547994050826652,0.11969194021018,-4.57837052239582,4.68612085582908e-06,3.38610023130875e-05 164 | "10490",195.34971214517,0.544853968146428,0.166426453885026,3.27384232150277,0.00106095795658738,0.00374259184685942 165 | "91748",1332.09809198065,0.543465094520931,0.123149861689813,4.41303861054914,1.01929809532618e-05,6.81560517471833e-05 166 | "55668",275.242022255629,0.541510317767095,0.148145232211379,3.65526658998008,0.00025691486378664,0.00108966489467019 167 | "80127",79.0738051815004,0.532271715723607,0.222192063147185,2.39554783453727,0.0165955486570083,0.0371740289916985 168 | "55072",702.163628132844,-0.531879193512233,0.140659772889485,-3.78131702181919,0.000156000863185405,0.000694861454883061 169 | "283578",208.745398578219,-0.5259283183146,0.168704678005758,-3.11744952500161,0.00182423181526487,0.00592214386404828 170 | "122509",209.643535351167,-0.524701697110866,0.18475338378455,-2.84001129702039,0.00451119362081539,0.0128727053638554 171 | "283600",0.550461492546276,0.523528566766729,0.599123978763048,0.873823424406426,0.382214421749911,NA 172 | "1743",1514.19888879154,0.521718114815406,0.11122237257948,4.69076591980256,2.72184224395567e-06,2.06675478863075e-05 173 | "7453",3893.71824011719,-0.521412850943666,0.108624464573438,-4.80014196609597,1.58553194675406e-06,1.26842555740324e-05 174 | "6729",607.502675796322,0.520144647505694,0.149578226599278,3.4774088403867,0.00050628517741201,0.00195530827138431 175 | "338005",304.178617003405,0.518608606759575,0.147145242177688,3.52446738395604,0.000424335050947289,0.00168231949402111 176 | "51222",806.643348887699,-0.518000738096405,0.19059744598518,-2.71777376354078,0.00657227597902945,0.0176310158000311 177 | "1073",378.133576171255,0.515571222488357,0.170174918588575,3.02965458579023,0.00244833572689433,0.00761704448367126 178 | "145389",113.777189377681,-0.515018021904882,0.224433765175872,-2.29474393704219,0.0217478032295133,0.0467666298752613 179 | "161247",6.69726937340053,0.511963268868847,0.587668530812848,0.871176933977922,0.383657552220604,0.492488777635618 180 | "23503",1184.72909009369,0.509856846738586,0.114209722737689,4.46421578230778,8.03625444722549e-06,5.45491210963185e-05 181 | "5427",96.4388016119171,0.505927572055077,0.213973823444131,2.36443675170938,0.018057510408259,0.0398510574527096 182 | "26520",116.934542268689,0.503453042248215,0.208833542586254,2.41078629425766,0.0159181725725574,0.0360168753156853 183 | "384",306.563931996922,0.497698182536876,0.148546406906343,3.3504558804352,0.000806786605187183,0.0029626262223267 184 | "115708",501.513726289288,0.496829782751993,0.169238739486455,2.93567409128426,0.0033282379558574,0.00994033736149409 185 | "51637",396.266441921992,0.494231316267667,0.15257563473295,3.23925453190943,0.00119842573092238,0.00409843303399408 186 | "145241",6.47866414663616,-0.492916070688717,0.624615815918924,-0.789150799781699,0.430023874006447,0.536631463941193 187 | "9578",4085.72471089244,-0.489462686002605,0.125744062869859,-3.89253118462684,9.92037430051905e-05,0.000483079096373101 188 | "30001",732.457150357068,-0.486747984576763,0.142078458864643,-3.42590979988377,0.000612743830326441,0.00228757696655205 189 | "6655",180.616921806147,0.486546667773334,0.182482732945438,2.66626140413411,0.00767000286010762,0.0200945104171241 190 | "2079",948.316810885468,0.485976315344017,0.114154583698701,4.25717741327583,2.07024066338432e-05,0.000132495402456597 191 | "29986",0.380986229138297,0.485355169844338,0.560917728427151,0.865287626414136,0.38688094052525,NA 192 | "641372",0.380986229138297,0.485355169844338,0.560917728427151,0.865287626414136,0.38688094052525,NA 193 | "11183",303.832111715731,-0.484034010061898,0.165057870577578,-2.93251093309362,0.00336233083771618,0.00997565705494602 194 | "90135",343.910124938093,0.482290968033721,0.155117180368582,3.10920406680759,0.00187592103319072,0.00600294730621031 195 | "4149",394.373803558462,0.480971102165786,0.134123332850361,3.58603601583921,0.000335742609793247,0.00139271008506828 196 | "122830",180.09243729405,0.480848148255564,0.178428430117556,2.69490768897513,0.00704081340654944,0.0187755024174652 197 | "100506321",65.5654658367666,0.480323427633997,0.243033253785128,1.97636915999431,0.0481129818122861,0.089810899382934 198 | "729665",2.80467402902896,0.480065126919403,0.722675065513158,0.664289041961782,0.506505333151167,0.60998491734334 199 | "57862",411.058560782882,0.48001182869739,0.138979868696421,3.45382272410905,0.000552700563114923,0.00210849107825267 200 | "55778",117.014215438235,-0.475856371305524,0.214374664477652,-2.21974164934556,0.0264363098811795,0.0543278294805892 201 | "81693",3.68234202063747,0.47533923304985,0.692852212602537,0.686061506918408,0.492674323185692,0.598152023813523 202 | "26037",1056.20474177558,-0.474794036268472,0.117102418631082,-4.05451946952741,5.02374664275109e-05,0.000274468109262499 203 | "122622",192.428804844724,0.473014455550312,0.18099438824314,2.61342056039265,0.00896409201421164,0.0225613102380158 204 | "84837",22.9090586920735,-0.471220787974927,0.39703318770948,-1.18685490926701,0.235284855733375,0.341784998969093 205 | "10001",141.221877652282,0.470904225243499,0.18908488412972,2.49043823577373,0.0127585669344584,0.0296157408634059 206 | "79944",270.31235806466,0.469951575662422,0.149928257618952,3.13450968567126,0.00172141570953501,0.00567054586670355 207 | "51077",177.001949667295,0.468591948501353,0.176040460443726,2.661842324885,0.00777142729919607,0.0201248521967621 208 | "1396",1456.12603877695,-0.466885652297618,0.148183318873893,-3.15073016211053,0.00162862879180377,0.00544496790095588 209 | "283571",1.46861441200196,-0.462294699304538,0.735085613590269,-0.628899125159885,0.529415098841371,NA 210 | "9240",661.513926674837,-0.461015410823631,0.126154835468136,-3.65436179368703,0.000257822497399644,0.00108966489467019 211 | "55837",196.35601758628,0.458566988712574,0.168414853904732,2.72284170950844,0.00647230606453007,0.0174674284151173 212 | "394",736.627442642957,0.45791169940337,0.151878061887233,3.01499567293243,0.00256982921907283,0.00793988613892847 213 | "84932",117.733342402507,0.457445244465456,0.200658548604319,2.27971969122281,0.0226243186713669,0.0478098809659073 214 | "4860",2170.52904214006,0.457438469456008,0.109606496550986,4.17346128058406,3.00006665753648e-05,0.000181625657104911 215 | "9556",1303.36747214995,0.456100989923994,0.117373605919156,3.88589058291473,0.000101955423490575,0.000491140104556747 216 | "23256",207.911052690631,-0.455900886141015,0.205364918351304,-2.21995504295985,0.0264218195193125,0.0543278294805892 217 | "5706",58.173144324597,0.45437151094867,0.27546597101963,1.64946512001764,0.0990523832734922,0.166200253582489 218 | "4901",14.2477103398457,-0.453748095732424,0.461455892390623,-0.983296785705199,0.325461391638011,0.435426059052633 219 | "56948",417.949792978999,-0.453567892704231,0.159237559293803,-2.84837254926378,0.00439434503770809,0.0127010746896337 220 | "9692",542.69428326241,0.453464706492158,0.129965726785914,3.48910991925686,0.000484631819716128,0.00188795700202457 221 | "4140",794.24177471666,0.452621067745769,0.120691328701293,3.75023684481917,0.00017666762363128,0.000767546272383669 222 | "100128927",236.11803724886,-0.452487280292052,0.17063611834376,-2.65176730860978,0.00800717032935643,0.0206161626870786 223 | "2530",216.210938823447,-0.451305618754248,0.175162429365784,-2.57649782769229,0.00998068326332107,0.024567835725098 224 | "55449",142.102111688682,-0.450770600045154,0.188734369318723,-2.38838639550553,0.0169225393206675,0.0377178985853684 225 | "84962",577.320321451022,0.447774003821377,0.137084921208361,3.2663986664207,0.00108924784061333,0.0037537156353444 226 | "53981",1048.85142573759,0.447650389958641,0.143948679022462,3.10979157987818,0.00187219397621029,0.00600294730621031 227 | "4329",246.960164683757,0.447643191637371,0.167157558369469,2.67797158563387,0.00740695015679091,0.019519492177896 228 | "317749",82.7580300559427,-0.443834646715034,0.232039629374233,-1.9127536443321,0.0557795981137979,0.100358473714785 229 | "112399",294.78949478351,-0.443731855564627,0.155768787349169,-2.84865705842573,0.00439041772330522,0.0127010746896337 230 | "51635",802.41807128455,-0.442329115856746,0.128117489223321,-3.45252719623412,0.000555361489361194,0.00210849107825267 231 | "81542",601.231714145266,-0.441494806109607,0.127962298902718,-3.45019439237528,0.000560183021030795,0.00210892431446887 232 | "57062",2348.41650548263,0.440373505453721,0.110214277000663,3.99561216058308,6.45272980965176e-05,0.000348291922255902 233 | "8487",57.715584578397,0.439871654347625,0.272226656660104,1.61582873530581,0.106131346932452,0.176754064779698 234 | "79789",21.9986173193528,0.439062473021326,0.386454643552366,1.1361293759738,0.255902422323076,0.360516620128108 235 | "11161",715.105508913183,0.438492219432314,0.12412329240078,3.53271502029185,0.00041131559174112,0.00164526236696448 236 | "4287",70.2822404296889,0.438267727378813,0.240794327519911,1.82009157729255,0.0687450600642271,0.119835746726746 237 | "3958",846.778110957202,-0.437483576782368,0.123386351975692,-3.54563993324445,0.00039166097558298,0.00158075781136194 238 | "1690",0.242997164174289,-0.437250728136416,0.495337480560693,-0.882732975589639,0.377380561070803,NA 239 | "9766",896.296524424026,0.434938646371191,0.131684570370554,3.3028823737458,0.000956965201150858,0.0034249628742119 240 | "145216",0.253504571050542,0.432632259412664,0.496857294270321,0.870737462047373,0.383897518706941,NA 241 | "1113",0.371979067831239,0.431595071618149,0.579135247661963,0.745240551944556,0.456126312029222,NA 242 | "4707",178.791345472213,0.428691703319275,0.172375374568477,2.48696604368495,0.0128837695670494,0.0297522101342171 243 | "10038",444.471305216434,0.423792413470318,0.133689117766327,3.16998436784553,0.0015244713883889,0.00513506151878366 244 | "7127",2015.20487712645,-0.420473749742107,0.137867534887832,-3.04983874618634,0.00228964264157115,0.00722366129171743 245 | "9321",350.524805417528,-0.419944221263059,0.162687972847247,-2.5812862125792,0.0098432936930397,0.0243635114612253 246 | "51218",227.670681076639,0.419476863683742,0.162227518206043,2.58573186794962,0.0097172490168919,0.0241851531087087 247 | "145200",0.247775729041724,-0.419412921727983,0.486431191550976,-0.862224563335862,0.388563955265745,NA 248 | "84502",0.247775729041724,-0.419412921727983,0.486431191550976,-0.862224563335862,0.388563955265745,NA 249 | "51199",708.409056056689,0.418832165333009,0.129415165536475,3.23634532009283,0.00121070822841957,0.00410907035099974 250 | "328",2347.23082623196,0.418059812456214,0.105694020476213,3.95537808641027,7.64137078935848e-05,0.000393486679727885 251 | "441687",0.267502439490617,0.4178399740206,0.492348268582714,0.84866749957993,0.396066335680888,NA 252 | "196913",0.258932569471319,-0.415152415296038,0.484339485121617,-0.857151704639143,0.391361049284835,NA 253 | "645687",0.240905502892244,-0.411527229678687,0.482781165174266,-0.852409454561346,0.393986874458091,NA 254 | "26499",44.3431365180018,-0.411224791104946,0.283974626048082,-1.44810399727516,0.147587967817735,0.230381218056952 255 | "58517",851.743223471815,0.406796577377911,0.123034818136252,3.30635330339916,0.000945188279238054,0.00341487378305361 256 | "55051",258.307878536129,-0.4016385902835,0.158180570373089,-2.53911456594312,0.0111133424365118,0.0266244781366699 257 | "55012",74.2766655347405,-0.401012343044333,0.241018159165683,-1.66382626285294,0.0961471540034516,0.161932048847919 258 | "100506499",0.628547888911741,-0.398354689274887,0.66168332650011,-0.602032231009861,0.547152683076694,NA 259 | "23508",2.46558891104577,0.397856173134218,0.714924422334326,0.556501024031552,0.577868394591393,0.679488296002478 260 | "57544",160.740297547376,-0.394305770640654,0.197452292879132,-1.99696729215509,0.0458287362025286,0.0862658563812303 261 | "801",2942.34974033974,-0.393921385991628,0.104204373288416,-3.78027690739365,0.000156654033355333,0.000694861454883061 262 | "122961",81.8699095689892,-0.389208641659444,0.229016890169391,-1.69947570841421,0.0892295877544899,0.152575783641265 263 | "624",361.641418947578,0.383154578414905,0.15064475956582,2.5434311788821,0.0109769678638395,0.0264391484032264 264 | "122553",133.794357172539,-0.380699291126736,0.200769657218314,-1.89619933809404,0.0579336906853524,0.103041443496957 265 | "100129075",31.4833332449977,0.380124028092704,0.320589916708008,1.18570175879524,0.235740099735379,0.341784998969093 266 | "57596",821.832052695162,-0.379436763106851,0.178497308526073,-2.12572820419546,0.03352589709822,0.066265238424245 267 | "9112",2186.05010257752,0.377169088027411,0.149855466850467,2.51688574300574,0.0118397217109953,0.0277706561598213 268 | "5411",1519.65466877779,0.374374091520246,0.114574533283709,3.26751574534652,0.00108495831171467,0.0037537156353444 269 | "63874",497.8177544446,-0.370941277089145,0.138362432597431,-2.68093925587741,0.00734158412491473,0.0194617141299515 270 | "8906",914.253047671338,-0.370846307691758,0.140860321136195,-2.63272371311148,0.00847032195623239,0.0215608195249552 271 | "64112",285.594077916298,-0.370613192649338,0.155687950908822,-2.38048731764981,0.0172897561803309,0.0383455978652884 272 | "91833",191.915754857527,0.370061215561318,0.165262053671827,2.23923887752342,0.025140377488638,0.0521430051616196 273 | "7517",453.254617345418,0.368297140173018,0.162984740298552,2.25970320594663,0.0238396772018087,0.0496752343553968 274 | "5684",519.235637448021,0.367767182397068,0.139799037121658,2.63068465970208,0.00852130622735119,0.0215680519200753 275 | "283551",1.37277277298017,-0.365893278777356,0.734541679163339,-0.498124598176808,0.618396223912423,NA 276 | "100616300",0.126022912962788,0.365385981793238,0.374900854954253,0.974620294845219,0.32974864280122,NA 277 | "122970",0.126022912962788,0.365385981793238,0.374900854954253,0.974620294845219,0.32974864280122,NA 278 | "319103",0.126022912962788,0.365385981793238,0.374900854954253,0.974620294845219,0.32974864280122,NA 279 | "56967",0.126022912962788,0.365385981793238,0.374900854954253,0.974620294845219,0.32974864280122,NA 280 | "5283",143.858465886597,-0.364913740313636,0.191735763622523,-1.90321165660077,0.0570129327512028,0.102167175490155 281 | "5018",2809.65296920578,0.364190555633631,0.103707285785242,3.51171620080578,0.000445223161699617,0.00174964891615288 282 | "55015",259.471209211921,0.36184943005149,0.165859476781818,2.18166267657705,0.0291344374366644,0.0590598550752293 283 | "85439",850.777585128837,0.361407068016596,0.120895552075515,2.98941575444272,0.0027951151539126,0.00857679170515649 284 | "100505967",0.398320434717103,-0.357703243833274,0.571224465918088,-0.626204347284676,0.531180916159589,NA 285 | "440163",0.398320434717103,-0.357703243833274,0.571224465918088,-0.626204347284676,0.531180916159589,NA 286 | "84312",82.0916750284388,0.357363755112942,0.236035657818517,1.51402444196678,0.130019640454546,0.206556024551903 287 | "55017",545.214836837902,-0.357320488909194,0.130657945464639,-2.7347781081243,0.00624223393301568,0.0170519561097014 288 | "23034",381.58724152163,0.352678030132048,0.136341376974607,2.58672780015806,0.00968921003656277,0.0241851531087087 289 | "5720",1005.81015687042,0.35258238362895,0.122566085872568,2.87667164304758,0.00401893556258595,0.0118452837634112 290 | "10972",5587.36073511244,-0.34906005676877,0.1037717420592,-3.3637293721988,0.000768969072877389,0.00284709210453777 291 | "122618",0.118581531414837,0.348927457602054,0.36696198936965,0.95085449640554,0.341678243750561,NA 292 | "83851",0.118581531414837,0.348927457602054,0.36696198936965,0.95085449640554,0.341678243750561,NA 293 | "9362",0.118581531414837,0.348927457602054,0.36696198936965,0.95085449640554,0.341678243750561,NA 294 | "641371",332.588597786726,-0.348253466983179,0.144948807807925,-2.40259628381806,0.0162791484056602,0.0366485351042 295 | "29091",2.50448591864301,-0.348138741823103,0.716235127192141,-0.486067673318275,0.626919185849236,0.716479069541984 296 | "10548",2026.383277715,-0.347923198972176,0.124589736759815,-2.7925510400821,0.00522942168400646,0.0146423807152181 297 | "57578",3.44740527247424,-0.347192478872561,0.685031311886428,-0.506827166653839,0.612276092973096,0.706431766653722 298 | "3895",4077.9144398207,0.345812909039355,0.135475025656044,2.55259526517704,0.0106923664548349,0.0258928657933298 299 | "5529",261.171892485766,0.342789276220602,0.159939300423096,2.14324606468706,0.0320933451804049,0.0639014161814284 300 | "53349",205.457665267819,-0.342319108248156,0.169230014714872,-2.02280374923393,0.0430933849895344,0.0821524956396231 301 | "10858",0.387163594287509,-0.34128453497435,0.567808795271691,-0.60105538663072,0.547803096697344,NA 302 | "283638",1329.67069021499,0.337540045001016,0.158968432424355,2.12331492393393,0.0337274709385027,0.066265238424245 303 | "122481",1.84785934512185,-0.336254666712809,0.734586260129398,-0.457747013473377,0.647134207387507,NA 304 | "10419",2139.13743449535,0.333995498493685,0.106863809352388,3.12543133655585,0.00177544513393606,0.00580583518250626 305 | "112840",133.424955079484,-0.333279843731912,0.210155700676107,-1.5858710596938,0.112768561983092,0.185736455030975 306 | "10979",1605.59606939013,0.33260514195071,0.120750559222638,2.75448117252574,0.00587852720824391,0.0163576409272874 307 | "145508",73.9463761913324,-0.330791586181098,0.241761997337215,-1.36825303324948,0.171232886547883,0.260812538585099 308 | "55640",333.988970315349,0.328472278909357,0.144203729318107,2.27783484146074,0.0227364196150309,0.0478212018194076 309 | "26175",104.274319462524,0.326658907976218,0.218729977709926,1.49343455979969,0.135323472130859,0.213468012375439 310 | "51292",641.458908619578,-0.319241216510753,0.127061010236762,-2.51250337074991,0.0119877959877635,0.0279715239714483 311 | "51528",805.300375733519,-0.318661522541899,0.150871910443338,-2.11213287884744,0.034675053618952,0.0675409740056108 312 | "63894",262.673223524986,-0.315745806320558,0.16050965795087,-1.96714521949329,0.0491664713356335,0.0910189221419992 313 | "5875",502.022580865566,-0.31571868400896,0.160195921280548,-1.9708284798091,0.0487434979491961,0.0906103198391695 314 | "27004",5.5987695682621,0.312363207046138,0.618259288572245,0.505230107852455,0.613397225956022,0.706431766653722 315 | "6430",1127.53477989518,-0.309569783284862,0.120543231330579,-2.56812248906697,0.0102251018017035,0.0248959000389302 316 | "2764",278.162579539551,0.30917254063676,0.198260364168828,1.55942687754516,0.118895378943456,0.191601186211037 317 | "7187",812.893970465571,0.307223132905796,0.131403639215893,2.33801084002731,0.0193866858219949,0.042574682589479 318 | "2954",379.351814506592,-0.305686348826925,0.151436965313583,-2.01857154357217,0.0435317711800459,0.0826365825790703 319 | "83544",77.6975608237549,-0.303063963193237,0.256418921745299,-1.1819095140501,0.237241606509902,0.342852386182052 320 | "399671",15.1962776697668,0.301969615865272,0.445615493763338,0.67764613235294,0.497996088589293,0.602979047805414 321 | "2643",128.759295021216,0.295731941160414,0.187774557985559,1.57493083372433,0.115272410593056,0.187789236166143 322 | "1735",8.12959877458449,-0.29544925952192,0.553345334742433,-0.533932864292501,0.593387983518991,0.691253103856079 323 | "23357",619.09186752831,0.295440013159353,0.154281311564228,1.91494361931425,0.0554996942261681,0.100257512150497 324 | "2877",1.21412409334817,0.292986260182569,0.732772124574733,0.399832704270246,0.689279740945444,NA 325 | "2100",5.81005327996789,0.292346558159733,0.606847345452377,0.481746456255488,0.629986070019345,0.718152059462255 326 | "23224",4152.34332616615,0.290004963185434,0.127070630367674,2.28223439473242,0.0224755054746285,0.0477205045148511 327 | "64398",799.083464528065,-0.289890574513772,0.126394894150815,-2.29353073525163,0.0218174679552,0.0467666298752613 328 | "51550",208.776588999014,0.28888029088555,0.160436641726452,1.80058799397022,0.0717678429125814,0.124620130328824 329 | "64582",31.9129063238833,-0.287166594415887,0.382217158993499,-0.751317903079207,0.452461357233016,0.561503290970613 330 | "122945",2.83728327830191,0.287068809413578,0.727491884624262,0.394600703431687,0.693137594312509,0.766729980869146 331 | "27109",70.0454353579069,-0.28609436091265,0.243339558296034,-1.17570017351886,0.239714709266342,0.345312507238975 332 | "4323",4.79040587745618,0.279705497887444,0.635834394437104,0.43990306333627,0.660007317113008,0.737364783208547 333 | "55172",203.4891713324,0.279560653141986,0.16647089473391,1.67933652059023,0.0930864772545196,0.158565558213022 334 | "207",2032.32652122213,0.279356468078045,0.139799978705281,1.99825830207721,0.045688663654604,0.0862658563812303 335 | "22846",55.0025234757467,-0.279130114378276,0.270401218874486,-1.03228127276986,0.301940371426092,0.412568082775539 336 | "51004",349.265901130191,-0.277673769261555,0.144347366198355,-1.92364971093403,0.0543985055122671,0.0990671970304702 337 | "51527",156.900063151762,-0.277182928274608,0.185963005245897,-1.49052725787095,0.136085654853771,0.213917099559611 338 | "58533",399.086536707901,0.276385137450703,0.139580564328385,1.98011190727432,0.0476909554715122,0.0893955985407426 339 | "11177",842.456675402816,0.273282413946731,0.14194863713897,1.92522041391058,0.0542017892643689,0.0990671970304702 340 | "283",22.4869525319274,-0.270541506663342,0.393627131202274,-0.687304012396233,0.491891170894737,0.598152023813523 341 | "3831",973.30817353131,-0.270311180168495,0.119506312742922,-2.26189875634418,0.0237036592358573,0.0496226137274022 342 | "122809",324.376854511496,-0.268958086546331,0.153080576220792,-1.75697069599742,0.0789228041838935,0.135990062593786 343 | "379025",62.666364319544,0.268711542563852,0.257428957276012,1.04382795706911,0.296564994439993,0.407549440212015 344 | "29082",42.1993161033554,-0.26632031596055,0.298656812288683,-0.891726908620198,0.372539326875431,0.483761212870125 345 | "5890",45.4322734370428,-0.265186486781387,0.313591593972598,-0.845642842086384,0.397752018065405,0.506229841174152 346 | "10640",291.45504854957,0.260541830799317,0.180561746257476,1.44295143461778,0.149034152533451,0.231830903940923 347 | "5721",76.7400952701813,0.259818561989378,0.24240349807465,1.07184328631002,0.283790427063313,0.392401578161618 348 | "8788",0.623350477155508,-0.25979204720537,0.633996707781716,-0.409768763806918,0.681975581974887,NA 349 | "1734",941.981114108177,-0.257092167943856,0.135597550060593,-1.89599419627399,0.057960811967038,0.103041443496957 350 | "29954",1514.34557881554,0.255442136886182,0.12021975779262,2.12479330832475,0.0336038634455599,0.066265238424245 351 | "27141",427.826922792813,0.255363576012404,0.133314057799741,1.91550373776786,0.0554282929245971,0.100257512150497 352 | "9878",815.01466464375,0.253951670707181,0.122166819533647,2.07872867343688,0.0376422959611364,0.0730032406519008 353 | "171546",1059.62168396793,-0.252580377898037,0.116393766410897,-2.17005073112222,0.0300030029864114,0.0605466006212267 354 | "55147",1571.99344008633,0.250910940915595,0.108028085838491,2.32264543954543,0.0201982075398188,0.0439261989215476 355 | "7597",245.070544900869,-0.250753250007562,0.164269483274881,-1.52647494232366,0.126891626248814,0.203026601998102 356 | "647310",7.92097100229312,-0.250572434077565,0.550880221881313,-0.454858287018971,0.64921119389667,0.730096602147412 357 | "145567",66.3122284623466,0.250375910020869,0.255056549195962,0.981648621884644,0.32627298501524,0.435426059052633 358 | "9950",210.514321001507,0.246503823030538,0.176857704693947,1.39379747948846,0.163378769513036,0.250663317609042 359 | "5700",241.095835856152,0.243650082656849,0.178284953130114,1.36663290075316,0.171740399291527,0.260812538585099 360 | "91612",176.322010522042,-0.24275085805165,0.193300946194296,-1.2558182607531,0.209181869379141,0.307884745986296 361 | "55218",647.67921470333,-0.242427571634415,0.132697857995435,-1.82691397808965,0.0677127042389817,0.118961927447309 362 | "161253",5.8779934560187,0.240548437603009,0.600800261741193,0.400380047947833,0.688876617439522,0.763902783695312 363 | "10202",53.4125358228483,-0.23839642393,0.288982791731122,-0.824950241853193,0.409399840621614,0.518110532764077 364 | "10577",1183.45168902212,-0.23645231732306,0.120543110437632,-1.96155812194177,0.0498139519358838,0.0918380677665676 365 | "1778",33118.8415116603,-0.236336348975265,0.116020198428026,-2.03702762258142,0.0416472739933071,0.0799192316746729 366 | "3091",2789.35036249145,-0.235855338348046,0.129326292024166,-1.82372303927166,0.0681939539380789,0.119339419391638 367 | "7043",157.407024331088,0.235153313352927,0.187426740476571,1.25464121477544,0.209609034655849,0.307884745986296 368 | "114088",0.860660299765207,-0.233822299698246,0.712602530870741,-0.328124430617071,0.742817582715481,NA 369 | "5891",81.8415374666515,-0.233618513548814,0.229532060911059,-1.01780340672904,0.308771392728889,0.419180557401643 370 | "123016",226.775921080972,-0.232073414175784,0.166391556797132,-1.39474272999761,0.163093434463717,0.250663317609042 371 | "599",605.273181172259,0.231068746765875,0.131287485197002,1.7600211202092,0.0784042256525486,0.135618120047652 372 | "9147",441.831968171799,0.229901340929063,0.145538753777707,1.57965720443237,0.114185391213221,0.18669728198366 373 | "145438",1.87102942734362,-0.22894723518139,0.734181889274388,-0.311839938475825,0.75516216990453,NA 374 | "55148",504.053772523441,0.22797591878988,0.131062771618923,1.73944069680398,0.081957275001787,0.140677621459006 375 | "4738",1068.17611216943,0.225504345876595,0.116415662600374,1.93706191108232,0.0527377758617423,0.0968300146969694 376 | "100506412",0.129466284735659,-0.224564553683371,0.374514863528414,-0.599614529494724,0.548763161715977,NA 377 | "84659",0.129466284735659,-0.224564553683371,0.374514863528414,-0.599614529494724,0.548763161715977,NA 378 | "5265",0.127374623453615,-0.221913848739733,0.372544065187931,-0.595671410381447,0.551394760947956,NA 379 | "84439",0.127374623453615,-0.221913848739733,0.372544065187931,-0.595671410381447,0.551394760947956,NA 380 | "57447",155.170394903188,0.215671408310248,0.189230816697441,1.13972666859586,0.254400192988979,0.359530872110608 381 | "161145",1.64746279960457,0.210545927663244,0.737203146861455,0.285600961634002,0.775183766845748,NA 382 | "100616158",0.118309444306065,-0.209906736415662,0.363393776681876,-0.577628869520843,0.563514711598653,NA 383 | "100886964",0.118309444306065,-0.209906736415662,0.363393776681876,-0.577628869520843,0.563514711598653,NA 384 | "145407",0.118309444306065,-0.209906736415662,0.363393776681876,-0.577628869520843,0.563514711598653,NA 385 | "145474",0.118309444306065,-0.209906736415662,0.363393776681876,-0.577628869520843,0.563514711598653,NA 386 | "27133",0.118309444306065,-0.209906736415662,0.363393776681876,-0.577628869520843,0.563514711598653,NA 387 | "338917",0.118309444306065,-0.209906736415662,0.363393776681876,-0.577628869520843,0.563514711598653,NA 388 | "253959",201.648717330029,0.208030919542369,0.179772739363181,1.15718834946438,0.247195416626112,0.351566814757137 389 | "90141",24.4837270753047,-0.207536622482438,0.380438360033124,-0.545519706436459,0.585396130838795,0.686537870721937 390 | "78990",46.8164519399536,-0.205515825650427,0.285836954780944,-0.718996694489441,0.472142956361349,0.577400352774452 391 | "100506071",0.113530879438629,-0.203330446356014,0.358231568779085,-0.567594997417451,0.57031000478715,NA 392 | "382",737.83712827266,-0.20310917654544,0.12486006866273,-1.62669441656383,0.10380201518375,0.173519786575822 393 | "115817",616.477938378626,0.202837556612749,0.150776665213746,1.34528480468247,0.178533278750041,0.270212530000062 394 | "22985",3016.39039310875,0.199127317799384,0.107598175407943,1.85065701202108,0.0642189121468099,0.113268002526657 395 | "6495",66.5246849241116,-0.196945138529458,0.254656990623484,-0.773374169102023,0.439300972596241,0.546685654786434 396 | "254170",204.062390446734,-0.192642813711905,0.165360464660124,-1.16498713345936,0.244024201035843,0.350393724564288 397 | "4331",119.188578115546,0.191992191002935,0.220311920162178,0.871456210184194,0.383505105731869,0.492488777635618 398 | "1983",2444.68667469688,0.190505368999447,0.123632788380176,1.54089680816415,0.12334190279656,0.198054381551466 399 | "9517",1554.26535349477,0.179035005224858,0.118015794375216,1.51704274985125,0.129255891035637,0.206073449053258 400 | "91754",1648.87606616573,-0.178101601751243,0.110811206153334,-1.60725262303161,0.107998962570186,0.178537030374329 401 | "112752",185.146903673386,-0.1733811719961,0.17673074243173,-0.981047041451073,0.326569544289475,0.435426059052633 402 | "9786",254.146157455785,-0.172972204386026,0.170607786585387,-1.0138587918404,0.31065013695031,0.420456982941809 403 | "57680",3836.28560711167,-0.172701650914912,0.103386594154854,-1.67044530605425,0.094831292138451,0.160925829083432 404 | "85457",282.509170182675,0.168113338658457,0.148914126638365,1.12892807723149,0.25892817533593,0.362499445470301 405 | "400242",11.4799255027677,-0.167462340834447,0.484401159197771,-0.345710033212525,0.729560643272612,0.791387816431308 406 | "9895",739.002073124613,-0.167068533641863,0.138592073875687,-1.20546961287063,0.228022061943711,0.332748807005806 407 | "112487",50.9102726583708,0.16658027277362,0.298903810055551,0.557303945850208,0.577319780212899,0.679488296002478 408 | "22795",4.82374320650146,-0.165852479361822,0.640673377194918,-0.258872126211923,0.79573390531283,0.850808567017059 409 | "57697",164.260380319255,0.165150594646287,0.184429090933219,0.895469330844818,0.370536253306437,0.48255884151536 410 | "8650",1103.36769567791,0.163975970335274,0.115824861222939,1.41572343453669,0.156856494325419,0.242316239509614 411 | "27030",232.701123222174,0.16274404274992,0.162193785705202,1.00339259018047,0.31567147785238,0.424687153386986 412 | "26094",119.974070654714,0.161823078151544,0.192622976148723,0.840102678231912,0.400850818589996,0.508728517643961 413 | "25831",2867.30747347967,-0.161292392360617,0.107289157868018,-1.50334288725644,0.132750646098509,0.210149432693045 414 | "79711",2864.04512166258,0.158924993121608,0.119251686517127,1.33268549706241,0.182635034591888,0.275361050363322 415 | "400224",1.73273859901924,-0.157850265912551,0.735174016495475,-0.214711432083811,0.829992308934793,NA 416 | "161159",0.270945811263489,-0.156605510544756,0.501182926581317,-0.312471758790743,0.754682022553883,NA 417 | "10175",1314.19330376357,-0.15587643952223,0.117270045522131,-1.32920933754403,0.18377891530945,0.275361050363322 418 | "3705",1366.66868550646,-0.154821736973523,0.153659147276219,-1.00756602986489,0.313662843842326,0.423255885666753 419 | "4247",534.462748965834,-0.153984635171658,0.134091927752396,-1.1483512673186,0.250823582856971,0.355597990885833 420 | "11198",1990.51963834867,0.151330038893955,0.113729947102751,1.33060854022233,0.183317850943794,0.275361050363322 421 | "87",8204.7123594792,-0.15001994466855,0.135274822627486,-1.10900122990121,0.267429651411738,0.372076036746766 422 | "677811",1.71507972613353,0.149731692304368,0.736963608364107,0.203173793936364,0.838999195050402,NA 423 | "10901",212.76436235429,-0.147289830847493,0.163451780215775,-0.901120995152534,0.367523991100263,0.481434935710286 424 | "64745",985.305811169643,0.144264717482314,0.124263405274463,1.16095899000735,0.245658577399199,0.351566814757137 425 | "6617",67.4939808027702,-0.143868420837191,0.269930810323029,-0.532982584185266,0.594045636126318,0.691253103856079 426 | "84193",500.130667603109,0.1425945469339,0.1381691972525,1.03202848224783,0.302058774889234,0.412568082775539 427 | "8106",91.419711245285,-0.140177510122818,0.219012951480193,-0.640042103334219,0.522145227344225,0.622130483644183 428 | "51339",0.255010405966459,-0.139881212499408,0.493211191687888,-0.283613216522318,0.776706807527367,NA 429 | "90809",772.281353977415,0.13892860766134,0.12501173555128,1.11132452524308,0.266428686470379,0.37183816678732 430 | "647286",0.259788970833894,-0.13743147217625,0.485834713889095,-0.282877011970006,0.777271116956159,NA 431 | "57594",275.328814476861,-0.13602730135369,0.150334561636116,-0.904830531803746,0.365555174003825,0.480260169952239 432 | "8816",708.727474955544,-0.135518750562175,0.124687489676447,-1.0868672624161,0.27709547628496,0.384330567726508 433 | "23588",516.196297186756,0.133554514156424,0.129812165917459,1.02882894844651,0.303560060953739,0.413358380873177 434 | "11154",43.9827922862683,-0.133234621565126,0.295313616606317,-0.451163150200219,0.651871966203047,0.730096602147412 435 | "256281",148.196016291302,0.130945380058728,0.197903335195332,0.661663331390977,0.508187009844328,0.610369384477906 436 | "55860",192.696379181633,-0.130847310829271,0.182340320607231,-0.717599433814325,0.473004306848714,0.577400352774452 437 | "100506433",1.35528444297627,0.129427338702992,0.735439696427098,0.175986337604257,0.860304675940042,NA 438 | "3429",7.64892592330244,-0.128937877001941,0.546872294609916,-0.235773284316611,0.813608599811959,0.863736143876203 439 | "5494",364.538884054995,0.127662058248495,0.142256272759981,0.897408991334183,0.369500716460388,0.48255884151536 440 | "101101776",25.4903279260031,0.124078231372693,0.355962406069729,0.348571167227101,0.727411276866134,0.79097148552434 441 | "414763",25.4903279260031,0.124078231372693,0.355962406069729,0.348571167227101,0.727411276866134,0.79097148552434 442 | "100302145",0.236890975720902,0.12271407934208,0.480712422765777,0.255275448543736,0.798510361126856,NA 443 | "64841",488.032864647589,-0.121354941704192,0.147578881608759,-0.822305606203955,0.410902986824473,0.518547994640462 444 | "26030",1248.88539383753,0.120961049614295,0.14121920710095,0.856548143113605,0.391694655309712,0.501369158796432 445 | "643382",31.8368684211581,0.119459149467213,0.331175183490832,0.360712865644174,0.718314105958838,0.78873705752343 446 | "283635",168.526110583332,-0.117308978464313,0.184328369156869,-0.636413043748459,0.524507241360299,0.623287119706668 447 | "64430",587.538388638622,-0.116951111806399,0.144336330191908,-0.81026801534237,0.417786153917509,0.52575336223327 448 | "5641",1222.23822512151,-0.115310790280814,0.124593371779799,-0.925496987790081,0.354707449831108,0.467379228012754 449 | "54207",0.245956154868452,0.11506194435816,0.485961104184167,0.236771921389318,0.812833735712259,NA 450 | "23093",538.97983645015,-0.114542255434352,0.131452999725605,-0.871355204319774,0.383560236792584,0.492488777635618 451 | "5693",3002.29692720442,-0.114301569110141,0.119935329756914,-0.953026679809936,0.340576553021612,0.450083468299948 452 | "6554",0.890852916646125,0.113774306029615,0.695949159940928,0.163480772128918,0.870139911102289,NA 453 | "51382",290.267324962901,0.113495478930463,0.156515258131433,0.725140029703401,0.468366129040498,0.575877733346636 454 | "57096",2.23860210229564,-0.112831325221111,0.730430646751621,-0.154472331798913,0.877237325381973,NA 455 | "93487",1201.03950847312,-0.108016208433205,0.112029522394436,-0.964176282506131,0.334957553133846,0.443967407704033 456 | "374569",6.74847143752185,-0.10727236010835,0.608563294812328,-0.176271492255265,0.860080657221128,0.898172807540945 457 | "6166",1274.39217432313,-0.107089685529642,0.110416999409593,-0.969865927368589,0.33211332568701,0.441503768272345 458 | "677",934.009599185288,0.10660317002246,0.145300668846361,0.733672947749338,0.46314808978441,0.573177746473524 459 | "6815",167.426508034665,0.105741759949831,0.176701625932274,0.598419846970506,0.549559826164749,0.651330164343406 460 | "84520",102.424327106805,0.103339794104056,0.207972527595384,0.496891562067786,0.619265522132918,0.711361420296275 461 | "23186",817.135573336417,-0.103112104476238,0.130159500999712,-0.792198062256449,0.428245198281421,0.535904605670604 462 | "4792",973.472293334456,-0.102144364077829,0.115892039188458,-0.881375155645741,0.378114800118106,0.48958216893905 463 | "2009",4.33486237140333,-0.098209686897393,0.652282890018474,-0.15056302779092,0.880320428387475,0.912924888698122 464 | "55632",103.96944592161,-0.0980592374483497,0.242409748993198,-0.404518538778328,0.685831448075064,0.762413123418433 465 | "2957",438.054205665003,-0.0979111949084367,0.14563143494875,-0.672321844132714,0.501378826272977,0.605438582669256 466 | "23405",704.60037056881,0.095921830167532,0.132524600403796,0.723803956965444,0.469186099713219,0.575877733346636 467 | "3169",370.470876131861,-0.0958226705228631,0.147952284273907,-0.64765928416127,0.517205344055783,0.617887984365308 468 | "54916",195.412037357972,-0.095189748309517,0.182867874976168,-0.52053838500569,0.602688381347239,0.69949325089006 469 | "51466",652.693608509333,0.0911461562724757,0.138897441879216,0.656211914627887,0.511687783829496,0.612930821271696 470 | "283643",382.099819729571,0.0892913587298993,0.172466254403725,0.517732347343021,0.604645025907804,0.699950831025055 471 | "55334",1810.10842247537,0.0875212722139334,0.109385809636977,0.800115412633443,0.423643932123837,0.531631601096579 472 | "64403",1334.11510565657,0.0867598515087949,0.147326457340707,0.588895253947185,0.555931541704819,0.657143352727596 473 | "56252",1279.79272180009,-0.0821661243693046,0.1127163385,-0.728963746185784,0.466023839265818,0.575147878763323 474 | "1152",13757.2273555563,-0.0791357679730935,0.160172181395234,-0.494066867815338,0.621258955973696,0.711826118353493 475 | "406986",0.141479526527829,0.0778722612282298,0.38108842408569,0.204341712596129,0.838086475643729,NA 476 | "442910",0.141479526527829,0.0778722612282298,0.38108842408569,0.204341712596129,0.838086475643729,NA 477 | "6252",0.141479526527829,0.0778722612282298,0.38108842408569,0.204341712596129,0.838086475643729,NA 478 | "9623",0.141479526527829,0.0778722612282298,0.38108842408569,0.204341712596129,0.838086475643729,NA 479 | "112849",157.715515478218,0.0763780926494963,0.217529233364138,0.351116452112169,0.725500985749109,0.79097148552434 480 | "6235",1039.37062427954,0.0762277090563349,0.355844408570334,0.214216402507469,0.830378305612586,0.875316425681032 481 | "3714",4388.0410852243,0.0759439589154301,0.179796891416996,0.422387496896684,0.672742202525814,0.749722653561106 482 | "122773",0.728507111444157,-0.0754987972277864,0.692294548961066,-0.109055888625858,0.913158161916667,NA 483 | "23116",137.148299137262,-0.0751511307614764,0.192096833698781,-0.391214833240394,0.695638444958267,0.767601042712571 484 | "10914",1238.80820484706,-0.0736605166307863,0.136560120851599,-0.539399908051003,0.589610944480027,0.689675465083687 485 | "957",271.04290877112,0.0725375280553122,0.151677291025493,0.478235914980316,0.632482302826034,0.719167694583917 486 | "145501",0.127481658087755,0.0723380870306647,0.368577811677791,0.196262728625407,0.844404532273136,NA 487 | "2003",0.127481658087755,0.0723380870306647,0.368577811677791,0.196262728625407,0.844404532273136,NA 488 | "23368",297.253914660692,-0.0708978242166742,0.150943147620043,-0.469698859037573,0.638570183144849,0.724251752022512 489 | "145226",1.59843573157565,-0.0683420333574104,0.736804916536776,-0.0927545837759069,0.926098532294316,NA 490 | "8110",2.7714974915651,-0.0681216907980659,0.713646181846261,-0.0954558330597797,0.923952770121433,0.949382662877068 491 | "22890",514.116693111494,-0.0642310628035167,0.140249768018329,-0.457976249879591,0.646969504392549,0.730081455838443 492 | "54930",357.222340031938,0.0638358078625714,0.138284506431031,0.461626609589913,0.644349108931689,0.728960608084335 493 | "89874",33.6059994970774,-0.061923159058826,0.321684681112035,-0.192496449768025,0.847353348585654,0.889026464089866 494 | "6400",1767.58151093855,-0.0512132728996478,0.113508868582698,-0.451183009214262,0.651857654349898,0.730096602147412 495 | "145282",41.2473834574326,-0.050519464428872,0.303996621789149,-0.1661842955081,0.868011917964965,0.904347300577452 496 | "5926",138.622912226507,0.0471469099063212,0.192390279169804,0.245058690645743,0.806410987880845,0.860171720406235 497 | "29966",411.282379091423,0.0447827179887914,0.142959049194105,0.3132555668301,0.754086504961168,0.816016314547351 498 | "5663",1245.51865050415,0.042911138441329,0.113021009751444,0.379673996327757,0.704187425192626,0.775125224782055 499 | "26277",274.991297105689,0.0424405234986663,0.150748435045623,0.281532100056772,0.778302310994315,0.838171719532339 500 | "4522",1684.45118501255,-0.039537485091443,0.112956208340768,-0.350024896127584,0.726320013667064,0.79097148552434 501 | "4293",408.170124757446,-0.0392248853435439,0.14149146499447,-0.277224391909979,0.78160781130757,0.839712948359211 502 | "283579",1.19047521939441,-0.0378498392470862,0.728995031513934,-0.0519205723096383,0.958591982049985,NA 503 | "23241",695.422830300979,-0.0376696410013677,0.175133961563007,-0.21509044085556,0.829696806586202,0.875316425681032 504 | "9985",33.9823818516358,-0.0362674865377648,0.346546951723306,-0.104653889920007,0.916650451473542,0.946219820875914 505 | "5836",1031.59281056639,0.0354564450877206,0.11949087944086,0.296729300626405,0.766673185315384,0.827637559087451 506 | "51241",98.7652859054145,0.0346113753774838,0.227168136897131,0.152360167452344,0.878902872004628,0.912924888698122 507 | "23002",148.616912758957,-0.0343398316562254,0.190819962156977,-0.179959325366472,0.857184500463533,0.897239850952483 508 | "8892",683.894155651366,0.0329723115337303,0.124520274261662,0.264794723021921,0.791167602258127,0.847949966056558 509 | "161424",514.721182351494,-0.0326572494410727,0.150158699586574,-0.217484897851317,0.827830475117279,0.875316425681032 510 | "80344",941.694825253544,-0.0288253345735764,0.121909360417866,-0.236448903306296,0.813084352418668,0.863736143876203 511 | "79882",720.107342908215,0.0267238490637471,0.131675952510957,0.202951629011556,0.839172839779231,0.882510404274872 512 | "319085",0.656231813537233,-0.0256666741536063,0.637018657909253,-0.040291871886212,0.967860433813788,NA 513 | "57161",1.34982768807623,-0.0231911700640487,0.735795424120321,-0.0315185027030779,0.974856036451835,NA 514 | "56339",603.374717737829,0.0193340263871433,0.131666762546076,0.146840599808759,0.883257837994181,0.913855684575966 515 | "161142",157.58882347319,-0.0191748144263907,0.188331338482142,-0.101814252375257,0.918904112681231,0.94636561489929 516 | "85495",13.7772996143032,0.0184393380656492,0.476394280914595,0.0387060441411027,0.969124754494401,0.983126079997857 517 | "283629",33.6562757967114,0.0113643678999274,0.320412787147166,0.0354678975240389,0.971706644326144,0.983126079997857 518 | "5826",338.203919945445,-0.0107342510807143,0.161561773522884,-0.0664405375519961,0.947027097184921,0.966442231295774 519 | "29933",0.496501735224831,0.0101667801744459,0.620986021075184,0.016371995229205,0.986937621324533,NA 520 | "122402",655.294326289855,0.0098108257143233,0.123596835258241,0.0793776450167577,0.936732249218829,0.960311321853628 521 | "5106",1825.27637627374,0.00980207444482611,0.137940211838238,0.0710603116683696,0.943349754984168,0.964887420623076 522 | "100750247",246.098352190864,0.00620993441760628,0.216051232319948,0.0287428789501652,0.977069658024916,0.985872087376492 523 | "79178",173.982566297219,-0.00598212821225816,0.171372031002736,-0.0349072609880117,0.972153690712166,0.983126079997857 524 | "1603",1084.93813710963,0.00447564311722734,0.120648184087738,0.0370966471735086,0.970407945358049,0.983126079997857 525 | "641455",32.8667883570347,-0.00289545755851296,0.360099448921897,-0.0080407164386996,0.993584505626131,0.9980400415258 526 | "8748",0.486968692394836,0.00154223521672577,0.618731795301086,0.0024925746962386,0.998011215192587,NA 527 | "7528",927.235426959943,-0.00152172960380106,0.122980409827027,-0.0123737561611755,0.990127422931953,0.996802439266326 528 | "122704",183.118502671239,0.000740079898197707,0.173435124654353,0.00426718578299897,0.996595288678331,0.998824808339804 529 | "123096",1091.25477087083,0.000115919740109735,0.152912741355643,0.000758077705507417,0.99939514156082,0.99939514156082 530 | -------------------------------------------------------------------------------- /inst/extdata/csaw-data-filtered.Rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/inst/extdata/csaw-data-filtered.Rds -------------------------------------------------------------------------------- /inst/extdata/csaw-normfacs.Rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/inst/extdata/csaw-normfacs.Rds -------------------------------------------------------------------------------- /inst/script/ChIPSeq/NFYA/csaw.R: -------------------------------------------------------------------------------- 1 | ## source("preprocess_NFYA.R", echo=TRUE) 2 | source("setup.R") 3 | 4 | library(csaw) 5 | library(edgeR) 6 | 7 | ## ext: fragment length 8 | system.time({ 9 | data <- windowCounts(files$bam, width=10, ext=110) 10 | }) # 156 seconds 11 | acc <- sub(".fastq.*", "", data$bam.files) 12 | colData(data) <- cbind(files[acc,], colData(data)) 13 | 14 | ## filter low count windows 15 | keep <- aveLogCPM(assay(data)) >= -1 16 | data <- data[keep,] 17 | 18 | ## normalization factor 19 | system.time({ 20 | binned <- windowCounts(files$bam, bin=TRUE, width=10000) 21 | }) #139 second 22 | normfacs <- normalize(binned) 23 | 24 | ## Experimmental design 25 | design <- model.matrix(~Treatment, colData(data)) 26 | 27 | ## DB windows 28 | y <- asDGEList(data, norm.factors=normfacs) 29 | y <- estimateDisp(y, design) 30 | fit <- glmQLFit(y, design, robust=TRUE) 31 | results <- glmQLFTest(fit, contrast=c(0, 1)) 32 | 33 | ## Correct for multiple testing 34 | merged <- mergeWindows(rowRanges(data), tol=1000L) 35 | tabcom <- combineTests(merged$id, results$table) 36 | -------------------------------------------------------------------------------- /inst/script/ChIPSeq/NFYA/preprocess_NFYA.R: -------------------------------------------------------------------------------- 1 | ## SystemRequirements: ascp; fastq-dump 2 | 3 | source("setup.R") 4 | library(SRAdb) 5 | library(Rsubread) 6 | library(Rsamtools) 7 | library(BiocParallel) 8 | 9 | sradb <- "SRAmetadb.sqlite" 10 | key <- "/app/aspera-connect/3.1.1/etc/asperaweb_id_dsa.openssh" 11 | cmd = sprintf("ascp -TT -l300m -i %s", key) 12 | source("setup.R") 13 | 14 | if (!file.exists(sradb)) 15 | getSRAdbFile() 16 | con = dbConnect(dbDriver("SQLite"), sradb) 17 | 18 | accs <- rownames(files)[!file.exists(files$sra)] 19 | for (acc in accs) 20 | sraFiles = ascpSRA(acc, con, cmd, fileType="sra", destDir=getwd()) 21 | 22 | sras <- files$sra[!file.exists(files$fastq)] 23 | bplapply(sras, function(sra) system(sprintf("fastq-dump --gzip %s", sra))) 24 | 25 | 26 | fastqs <- files$fastq[!file.exists(files$bam)] 27 | if (length(fastqs)) 28 | Rsubread::align("../mm10/mm10.Rsubread.index", fastqs, 29 | nthreads=parallel::detectCores() / 2L) 30 | 31 | bams <- files$bam[!file.exists(sprintf("%s.bai", files$bam))] 32 | bams_sorted <- sub(".BAM", ".sorted.bam", bams) 33 | sorted <- bpmapply(sortBam, bams, bams_sorted) 34 | ## oops! didn't mean to do the next line 35 | file.rename(sorted, names(sorted)) 36 | bplapply(sorted, indexBam) 37 | -------------------------------------------------------------------------------- /inst/script/ChIPSeq/NFYA/setup.R: -------------------------------------------------------------------------------- 1 | files <- local({ 2 | acc <- c(es_1="SRR074398", es_2="SRR074399", tn_1="SRR074417", 3 | tn_2="SRR074418") 4 | data.frame(Treatment=sub("_.*", "", names(acc)), 5 | Replicate=sub(".*_", "", names(acc)), 6 | sra=sprintf("%s.sra", acc), 7 | fastq=sprintf("%s.fastq.gz", acc), 8 | bam=sprintf("%s.fastq.gz.subread.BAM", acc), 9 | row.names=acc, stringsAsFactors=FALSE) 10 | }) 11 | -------------------------------------------------------------------------------- /vignettes/A_Introduction.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "A. Introduction: Use R / Bioconductor for Sequence Analysis" 3 | author: Martin Morgan (mtmorgan@fredhutch.org) 4 | date: "`r Sys.Date()`" 5 | output: 6 | BiocStyle::html_document: 7 | toc: true 8 | vignette: > 9 | %\VignetteIndexEntry{A. Introduction} 10 | %\VignetteEngine{knitr::rmarkdown} 11 | \usepackage[utf8]{inputenc} 12 | --- 13 | 14 | ```{r style, echo=FALSE, results='asis'} 15 | BiocStyle::markdown() 16 | ``` 17 | 18 | # Introduction 19 | 20 | This **INTERMEDIATE** course is designed for individuals comfortable 21 | using _R_, and with some familiarity with _Bioconductor_. It consists 22 | of approximately equal parts lecture and practical sessions addressing 23 | use of _Bioconductor_ software for analysis and comprehension of 24 | high-throughput sequence and related data. Specific topics include use 25 | of central Bioconductor classes (e.g., _GRanges_, 26 | _SummarizedExperiment_), RNASeq gene differential expression, ChIP-seq 27 | and methylation work flows, approaches to management and integrative 28 | analysis of diverse high-throughput data types, and strategies for 29 | working with large data. Participants are required to bring a laptop 30 | with wireless internet access and a modern version of the Chrome or 31 | Safari web browser. 32 | 33 | # Schedule (tentative) 34 | 35 | Day 1 (9:00 - 12:30; 1:30 - 5:00) 36 | 37 | - [B. Genomic Ranges](B_GenomicRanges.html). Working with Genomic 38 | Ranges and other _Bioconductor_ data structures (e.g., in the 39 | [GenomicRanges](http://bioconductor.org/packages/devel/bioc/html/GenomicRanges.html). 40 | package). 41 | - [C. Differential Gene Expression](C_DifferentialExpression.html). RNA-Seq 42 | known gene differential expression with 43 | [DESeq2](http://bioconductor.org/packages/devel/bioc/html/DESeq2.html) 44 | and 45 | [edgeR](http://bioconductor.org/packages/devel/bioc/html/edgeR.html). 46 | 47 | Day 2 (9:00 - 12:30; 1:30 - 5:00) 48 | 49 | - [D. Machine Learning](D_MachineLearning.html). 50 | - [E. Gene Set Enrichment](E_GeneSetEnrichment.html). 51 | - [F. ChIP-seq](F_ChIPSeq.html) ChIP-seq with 52 | [csaw](http://bioconductor.org/packages/devel/bioc/html/csaw.html) 53 | - [I. Large Data](I_LargeData.html) -- efficient, parallel, and cloud 54 | programming with 55 | [BiocParallel](http://bioconductor.org/packages/devel/bioc/html/BiocParallel.html), 56 | [GenomicFiles](http://bioconductor.org/packages/devel/bioc/html/GenomicFiles.html), 57 | and other resources. 58 | 59 | # Bioconductor 60 | 61 | Analysis and comprehension of high throughput genomic data 62 | 63 | - Sequencing, microarrays, proteomics, flow cytometry, imaging, ... 64 | - Recent publication: Huber et al., Orchestrating high-throughput 65 | genomic analysis with Bioconductor. Nature Methods 66 | 12:[115-121](http://www.nature.com/nmeth/journal/v12/n2/abs/nmeth.3252.html) 67 | 68 | Packages 69 | 70 | - Software 71 | - Annotation -- static, versioned annotations, e.g., genome sequences, 72 | gene models, gene identifiers 73 | - Experiment data -- example experiments used to illustrate work 74 | flows and analysis 75 | - Recently: `r Biocpkg("AnnotationHub")` for easy web-based retrieval 76 | of genome-scale annotation 77 | 78 | # Sequencing work flows & ecosystem 79 | 80 | 0. Research question 81 | 1. Experimental Design 82 | 2. Wet lab 83 | 3. Sequencing 84 | 4. Alignment 85 | 5. _Reduction_ 86 | 6. _Analysis_ 87 | 7. _Comprehension_ (annotation, integration, visualization) 88 | 89 | ![](our_figures/SequencingEcosystem.png) 90 | -------------------------------------------------------------------------------- /vignettes/B_GenomicRanges.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: B. Genomic Ranges 3 | author: Martin Morgan (mtmorgan@fredhutch.org) 4 | date: "`r Sys.Date()`" 5 | output: 6 | BiocStyle::html_document: 7 | toc: true 8 | vignette: > 9 | %\VignetteIndexEntry{B. Genomic Ranges} 10 | %\VignetteEngine{knitr::rmarkdown} 11 | \usepackage[utf8]{inputenc} 12 | --- 13 | 14 | ```{r style, echo=FALSE, results='asis'} 15 | BiocStyle::markdown() 16 | suppressPackageStartupMessages({ 17 | library(GenomicAlignments) 18 | library(BSgenome.Hsapiens.UCSC.hg19) 19 | library(TxDb.Hsapiens.UCSC.hg19.knownGene) 20 | }) 21 | ``` 22 | 23 | # Motivation 24 | 25 | Genomic ranges describe... 26 | 27 | - Annotations, e.g., exons, genes, binding sites, .. 28 | - Data, e.g., aligned reads, called peaks, copy number regions 29 | - See Lawrence et al., Software for computing and annotating genomic 30 | ranges, PLoS Comput Biol 9(8): 31 | [e1003118](https://doi.org/10.1371/journal.pcbi.1003118) 32 | 33 | Packages 34 | 35 | - `r Biocpkg("GenomicRanges")` for essential genomic ranges; depends 36 | on `r Biocpkg("IRanges")`, `r Biocpkg("S4Vectors")` and other 37 | packages 38 | - `r Biocpkg("GenomicAlignments")` for aligned reads using genomic 39 | range-based concepts. Depends on `r Biocpkg("Rsamtools")`, `r 40 | Biocpkg("Biostrings")` and other packages 41 | 42 | ```{r packages} 43 | library(GenomicRanges) 44 | library(GenomicAlignments) 45 | sessionInfo() 46 | ``` 47 | 48 | ## Use cases 49 | 50 | `GRanges`: simple genomic ranges 51 | 52 | - e.g., exons, binding sites 53 | - e.g., called peaks, reads aligned without gaps 54 | - _vector_ of genomic ranges 55 | - _metadata_ `mcols()` of associated data, e.g., 'score', 'id', ... 56 | 57 | ![](our_figures/GRanges.png) 58 | 59 | - `seqname()`, e.g., chromosome, but could be, e.g., contig, ... 60 | - `start()`, `end()`: 1-based, closed intervals 61 | - `strand()`: +, -, or * (does not matter) 62 | - `mcols()` 63 | 64 | `GRangesList`: nested genomic ranges 65 | 66 | - e.g., exons-within-transcripts 67 | - e.g., aligned reads with gaps; paired-end reads 68 | 69 | ![](our_figures/GRangesList.png) 70 | 71 | - Accessors resturn `*List` objects: lists, but all elements of the 72 | same type. E.g., `start()` returns an `IntegerList()`. 73 | - `unlist()` 74 | 75 | # Range-based operations 76 | 77 | ![](our_figures/RangeOperations.png) 78 | 79 | Intra-range operations 80 | 81 | - e.g., `range()`, `flank()` 82 | 83 | Inter-range operations 84 | 85 | - e.g., `reduce()`, `disjoin()` 86 | 87 | Between-object 88 | 89 | - e.g., `psetdiff()`, `findOverlaps()`, `countOverlaps()` 90 | 91 | ![](our_figures/journal.pcbi.1003118.t001.png) 92 | 93 | PLoS Comput Biol 9(8): 94 | [e1003118](https://doi.org/10.1371/journal.pcbi.1003118) 95 | 96 | # Working with _Bioconductor_ classes and methods 97 | 98 | What can I do with my `GRanges` instance? 99 | 100 | ```{r methods-class} 101 | methods(class="GRanges") 102 | ``` 103 | 104 | What type of object(s) can I use `findOverlaps()` on (what _methods_ 105 | exist for the `findOverlaps()` _generic_)? 106 | 107 | ```{r methods-generic} 108 | methods(findOverlaps) 109 | ``` 110 | 111 | How can I get help on functions, generics, and methods? 112 | 113 | ```{r help, eval=FALSE} 114 | ?"findOverlaps" ## generic 115 | ?"findOverlaps," ## specific method 116 | ``` 117 | 118 | Other help? 119 | 120 | - Vignettes, e.g., `browseVignettes("GenomicRanges")` 121 | - Work flows, vidoes, training material on the Bioconductor 122 | [web site](http://bioconductor.org/) 123 | - Questions and answers on the Bioconductor 124 | [support site](https://support.bioconductor.org) 125 | 126 | # Important parts of the sequence class menagerie 127 | 128 | `GAlignments` and friends (`r Biocpkg("GenomicAlignments")`) 129 | 130 | - `GAlignments`: Single-end aligned reads, e.g., from BAM files 131 | - `GAlignmentPairs`, `GAlignmentsList`: paired-end aligned 132 | reads. `*Pairs` is more restrictive on what pairs can be represented 133 | 134 | `DNAString` and `DNAStringSet` (`r Biocpkg("Biostrings")`) 135 | 136 | `SummarizedExperiment` (`r Biocpkg("GenomicRanges")`) 137 | 138 | ![](our_figures/nmeth.3252-F2.jpg) 139 | 140 | - `assays` of rows (regions of interest; genomic ranges) x columns 141 | (samples, including integrated phenotypic information) 142 | 143 | `TxDb` (`r Biocpkg("AnnotationDb")`) 144 | 145 | - Coordinating transcript-level descriptions derived, e.g., from GTF, 146 | UCSC, Biomart 147 | - `transcripts()` interface 148 | - `select()` interface 149 | 150 | `VCF` (`r Biocpkg("VariantAnnotation")`) 151 | 152 | Lower-level classes 153 | 154 | - `DataFrame` (`r Biocpkg("S4Vectors")`) (like a `data.frame`, but can 155 | contain S4 objects) 156 | - `IRanges` (`r Biocpkg("IRanges")`), `Rle` (`r Biocpkg("S4Vectors")`) 157 | 158 | ![](our_figures/FilesToPackages.png) 159 | 160 | # Deeper understanding 161 | 162 | ## Classes and class hierarchies 163 | 164 | - 'Ideally', the user can remain ignorant of how classes are 165 | implemented. Actually, it sometimes helps! 166 | - Slots 167 | - Contained classes 168 | 169 | _R_ works efficiently on vectors 170 | 171 | - Represent `GRanges` as a collection of vectors, not as a collection 172 | of records 173 | 174 | ```{r getClass-GRanges} 175 | getClass("GRanges") 176 | ``` 177 | 178 | ![](our_figures/GRangesImplementation.png) 179 | 180 | ## `Vector` and `Annotated` 181 | 182 | - `[`, `length()`, `names()`, etc. 183 | - `mcols()` 184 | 185 | ## `List`-like 186 | 187 | - `[[` 188 | - `elementLengths()` 189 | 190 | Implementation: `Vector` plus partitioning 191 | 192 | - Consequence: `unlist()` and `relist()` are very inexpensive 193 | 194 | ![](our_figures/GRangesListImplementation.png) 195 | 196 | # Practical 197 | 198 | ## 1. Exon and transcript characterization 199 | 200 | Ingredients 201 | 202 | - `r Biocpkg("TxDb.Hsapiens.UCSC.hg19")` TxDb package 203 | - `exons()`, and `exonsBy()` functions 204 | - `width()`, `elementLengths()` accessors 205 | - `hist()` 206 | 207 | Goals 208 | 209 | - Histogram of exon widths 210 | - Histogram of transcript widths 211 | - Transcripts with exactly one exon 212 | - Transcripts with maximum number of exons 213 | 214 | ## 2. GC content 215 | 216 | Ingredients 217 | - `r Biocpkg("BSgenome.Hsapiens.UCSC.hg19")` BSGenome package 218 | - `r Biocpkg("TxDb.Hsapiens.UCSC.hg19")` TxDb package 219 | - `?"getSeq,BSgenome-method"`, `letterFrequency()` 220 | 221 | Goapls 222 | 223 | - GC content of exons, transcripts 224 | - GC content of non-coding regions 225 | 226 | ## 3. CpG islands 227 | 228 | Ingredients 229 | 230 | - `DNAStringSet()` to construct CG island sequence 231 | - `matchPDict()` to find CG islands on BSgenome chromosomes 232 | - ?? `coverage()`, `tileGenome()`, `Views()`, following 233 | [Hints](https://support.bioconductor.org/p/65281/) 234 | - ?? `tileGenome()`, `findOverlaps()`, `splitAsList()`, `mean()` 235 | 236 | Goal 237 | 238 | - Average number of CpG islands in 'tiles' across chr 17 239 | 240 | ## 4. Aligned reads 241 | 242 | Ingredients 243 | 244 | - `r Biocexptpkg("RNAseqData.HNRNPC.bam.chr14")` and 245 | `RNAseqData.HNRNPC.bam.chr14_BAMFILES` 246 | - `readGAlignments()`, `readGAlignmentPairs()`, and 247 | `readGAlignmentsList()` 248 | - `BamFile()` 249 | 250 | Goals 251 | 252 | - Compare three input methods 253 | - Use `ScanBamParam()` `which` and `what` to selective input data 254 | - `countOverlaps()` between reads and known genes 255 | - Use `BamFile()` `yieldSize` argument to iterate through file 256 | 257 | -------------------------------------------------------------------------------- /vignettes/C_DifferentialExpression.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: C. Differential Expression 3 | author: Martin Morgan (mtmorgan@fredhutch.org) 4 | date: "`r Sys.Date()`" 5 | output: 6 | BiocStyle::html_document: 7 | toc: true 8 | vignette: > 9 | %\VignetteIndexEntry{C. Differential Expression} 10 | %\VignetteEngine{knitr::rmarkdown} 11 | \usepackage[utf8]{inputenc} 12 | --- 13 | 14 | ```{r style, echo=FALSE, results='asis'} 15 | BiocStyle::markdown() 16 | options(max.print=1000) 17 | suppressPackageStartupMessages({ 18 | library(genefilter) 19 | library(airway) 20 | library(DESeq2) 21 | library(GenomicAlignments) 22 | library(GenomicFeatures) 23 | }) 24 | ``` 25 | 26 | # Work flow 27 | 28 | ## 1. Experimental design 29 | 30 | Keep it simple 31 | 32 | - Classical experimental designs 33 | - Time series 34 | - Without missing values, where possible 35 | - Intended analysis must be feasbile -- can the available samples and 36 | hypothesis of interest be combined to formulate a testable 37 | statistical hypothesis? 38 | 39 | Replicate 40 | 41 | - Extent of replication determines nuance of biological question. 42 | - No replication (1 sample per treatment): qualitative description 43 | with limited statistical options. 44 | - 3-5 replicates per treatment: designed experimental manipulation 45 | with cell lines or other well-defined entities; 2-fold (?) 46 | change in average expression between groups. 47 | - 10-50 replicates per treatment: population studies, e.g., cancer 48 | cell lines. 49 | - 1000's of replicates: prospective studies, e.g., SNP discovery 50 | - One resource: `r Biocpkg("RNASeqPower")` 51 | 52 | Avoid confounding experimental factors with other factors 53 | 54 | - Common problems: samples from one treatment all on the same flow 55 | cell; samples from treatment 1 processed first, treatment 2 56 | processed second, etc. 57 | 58 | Record co-variates 59 | 60 | Be aware of _batch effects_ 61 | 62 | - Known 63 | 64 | - Phenotypic covariates, e.g., age, gender 65 | - Experimental covariates, e.g., lab or date of processing 66 | - Incorporate into linear model, at least approximately 67 | 68 | - Unknown 69 | 70 | - Or just unexpected / undetected 71 | - Characterize using, e.g., `r Biocpkg("sva")`. 72 | 73 | - Surrogate variable analysis 74 | 75 | - Leek et al., 2010, Nature Reviews Genetics 11 76 | [733-739](http://www.nature.com/nrg/journal/v11/n10/abs/nrg2825.html), 77 | Leek & Story PLoS Genet 3(9): 78 | [e161](https://doi.org/10.1371/journal.pgen.0030161). 79 | - Scientific finding: pervasive batch effects 80 | - Statistical insights: surrogate variable analysis: identify and 81 | build surrogate variables; remove known batch effects 82 | - Benefits: reduce dependence, stabilize error rate estimates, and 83 | improve reproducibility 84 | - _combat_ software / `r Biocpkg("sva")` _Bioconductor_ package 85 | 86 | ![](our_figures/nrg2825-f2.jpg) 87 | HapMap samples from one facility, ordered by date of processing. 88 | 89 | ## 2. Wet-lab 90 | 91 | Confounding factors 92 | 93 | - Record or avoid 94 | 95 | Artifacts of your _particular_ protocols 96 | 97 | - Sequence contaminants 98 | - Enrichment bias, e.g., non-uniform transcript representation. 99 | - PCR artifacts -- adapter contaminants, sequence-specific 100 | amplification bias, ... 101 | 102 | ## 3. Sequencing 103 | 104 | Axes of variation 105 | 106 | - Single- versus paired-end 107 | - Length: 50-200nt 108 | - Number of reads per sample 109 | 110 | Application-specific, e.g., 111 | 112 | - ChIP-seq: short, single-end reads are usually sufficient 113 | - RNA-seq, known genes: single- or paired-end reads 114 | - RNA-seq, transcripts or novel variants: paired-end reads 115 | - Copy number: single- or paired-end reads 116 | - Structural variants: paired-end reads 117 | - Variants: depth via longer, paired-end reads 118 | - Microbiome: long paired-end reads (overlapping ends) 119 | 120 | ## 4. Alignment 121 | 122 | Alignment strategies 123 | 124 | - _de novo_ 125 | - No reference genome; considerable sequencing and computational 126 | resources 127 | - Genome 128 | - Established reference genome 129 | - Splice-aware aligners 130 | - Novel transcript discovery 131 | - Transcriptome 132 | - Established reference genome; reliable gene model 133 | - Simple aligners 134 | - Known gene / transcript expression 135 | 136 | Splice-aware aligners (and _Bioconductor_ wrappers) 137 | 138 | - [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2) (`r Biocpkg("Rbowtie")`) 139 | - [STAR](http://bowtie-bio.sourceforge.net/bowtie2) 140 | ([doi](https://doi.org/10.1093/bioinformatics/bts635)) 141 | - [subread](https://doi.org/10.1093/nar/gkt214) (`r Biocpkg("Rsubread")`) 142 | - Systematic evaluation (Engstrom et al., 2013, 143 | [doi](https://doi.org/10.1038/nmeth.2722)) 144 | 145 | ## (5a. Bowtie2 / tophat / Cufflinks / Cuffdiff / etc) 146 | 147 | - [tophat](http://ccb.jhu.edu/software/tophat) uses Bowtie2 to perform 148 | basic single- and paired-end alignments, then uses algorithms to 149 | place difficult-to-align reads near to their well-aligned mates. 150 | - [Cufflinks](http://cole-trapnell-lab.github.io/cufflinks/) 151 | ([doi](https://doi.org/10.1038/nprot.2012.016)) takes _tophat_ 152 | output and estimate existing and novel transcript abundance. 153 | [How Cufflinks Works](http://cufflinks.cbcb.umd.edu/howitworks.html) 154 | - [Cuffdiff](http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/) 155 | assesses statistical significance of estimated abundances between 156 | experimental groups 157 | - [RSEM](http://www.biomedcentral.com/1471-2105/12/323) includes de 158 | novo assembly and quantification 159 | 160 | ## 5. Reduction to 'count tables' 161 | 162 | - Use known gene model to count aligned reads overlapping regions of 163 | interest / gene models 164 | - Gene model can be public (e.g., UCSC, NCBI, ENSEMBL) or _ad hoc_ (gff file) 165 | - `GenomicAlignments::summarizeOverlaps()` 166 | - `Rsubread::featureCount()` 167 | - [HTSeq](http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html), 168 | [htseq-count](http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) 169 | 170 | ## 6. Analysis 171 | 172 | Unique statistical aspects 173 | 174 | - Large data, few samples 175 | - Comparison of each gene, across samples; _univariate_ measures 176 | - Each gene is analyzed by the _same_ experimental design, under the 177 | _same_ null hypothesis 178 | 179 | Summarization 180 | 181 | - Counts _per se_, rather than a summary (RPKM, FRPKM, ...), are 182 | relevant for analysis 183 | - For a given gene, larger counts imply more information; RPKM etc., 184 | treat all estimates as equally informative. 185 | - Comparison is across samples at _each_ region of interest; all 186 | samples have the same region of interest, so modulo library size 187 | there is no need to correct for, e.g., gene length or mapability. 188 | 189 | Normalization 190 | 191 | - Libraries differ in size (total counted reads per sample) for 192 | un-interesting reasons; we need to account for differences in 193 | library size in statistical analysis. 194 | - Total number of counted reads per sample is _not_ a good estimate of 195 | library size. It is un-necessarily influenced by regions with large 196 | counts, and can introduce bias and correlation across 197 | genes. Instead, use a robust measure of library size that takes 198 | account of skew in the distribution of counts (simplest: trimmed 199 | geometric mean; more advanced / appropriate encountered in the lab). 200 | - Library size (total number of counted reads) differs between 201 | samples, and should be included _as a statistical offset_ in 202 | analysis of differential expression, rather than 'dividing by' the 203 | library size early in an analysis. 204 | 205 | Appropriate error model 206 | 207 | - Count data is _not_ distributed normally or as a Poisson process, 208 | but rather as negative binomial. 209 | - Result of a combination Poisson (`shot' noise, i.e., within-sample 210 | technical and sampling variation in read counts) with variation 211 | between biological samples. 212 | - A negative binomial model requires estimation of an additional 213 | parameter ('dispersion'), which is estimated poorly in small 214 | samples. 215 | - Basic strategy is to moderate per-gene estimates with more robust 216 | local estimates derived from genes with similar expression values (a 217 | little more on borrowing information is provided below). 218 | 219 | Pre-filtering 220 | 221 | - Naively, a statistical test (e.g., t-test) could be applied to each 222 | row of a counts table. However, we have relatively few samples 223 | (10's) and very many comparisons (10,000's) so a naive approach is 224 | likely to be very underpowered, resulting in a very high _false 225 | discovery rate_ 226 | - A simple approach is perform fewer tests by removing regions that 227 | could not possibly result in statistical significance, regardless of 228 | hypothesis under consideration. 229 | - Example: a region with 0 counts in all samples could not possibly be 230 | significant regradless of hypothesis, so exclude from further 231 | analysis. 232 | - Basic approaches: 'K over A'-style filter -- require a minimum of A 233 | (normalized) read counts in at least K samples. Variance filter, 234 | e.g., IQR (inter-quartile range) provides a robust estimate of 235 | variability; can be used to rank and discard least-varying regions. 236 | - More nuanced approaches: `r Biocpkg("edgeR")` vignette; work flow 237 | today. 238 | 239 | Borrowing information 240 | 241 | - Why does low statistical power elevate false discovery rate? 242 | - One way of developing intuition is to recognize a t-test (for 243 | example) as a ratio of variances. The numerator is 244 | treatment-specific, but the denominator is a measure of overall 245 | variability. 246 | - Variances are measured with uncertainty; over- or under-estimating 247 | the denominator variance has an asymmetric effect on a t-statistic 248 | or similar ratio, with an underestimate _inflating_ the statistic 249 | more dramatically than an overestimate deflates the statistic. Hence 250 | elevated false discovery rate. 251 | - Under the typical null hypothesis used in microarray or RNA-seq 252 | experiments, each gene may respond differently to the treatment 253 | (numerator variance) but the overall variability of a gene is 254 | the same, at least for genes with similar average expression 255 | - The strategy is to estimate the denominator variance as the 256 | between-group variance for the gene, _moderated_ by the average 257 | between-group variance across all genes. 258 | - This strategy exploits the fact that the same experimental design 259 | has been applied to all genes assayed, and is effective at 260 | moderating false discovery rate. 261 | 262 | ## 7. Comprehension 263 | 264 | Placing differentially expressed regions in context 265 | 266 | - Gene names associated with genomic ranges 267 | - Gene set enrichment and similar analysis 268 | - Proximity to regulatory marks 269 | - Integrate with other analyses, e.g., methylation, copy number, 270 | variants, ... 271 | 272 | ![Copy number / expression QC](our_figures/copy_number_QC_2.png) 273 | Correlation between genomic copy number and mRNA expression 274 | identified 38 mis-labeled samples in the TCGA ovarian cancer 275 | Affymetrix microarray dataset. 276 | 277 | # Experimental and statistical issues in depth 278 | 279 | ## Normalization 280 | 281 | `r Biocpkg("DESeq2")` `estimateSizeFactors()`, Anders and Huber, 282 | [2010](http://genomebiology.com/2010/11/10/r106) 283 | 284 | - For each gene: geometric mean of all samples. 285 | - For each sample: median ratio of the sample gene over the geometric 286 | mean of all samples 287 | - Functions other than the median can be used; control genes can be 288 | used instead 289 | 290 | `r Biocpkg("edgeR")` `calcNormFactors()` TMM method of Robinson and 291 | Oshlack, [2010](http://genomebiology.com/2010/11/3/r25) 292 | 293 | - Identify reference sample: library with upper quartile closest to 294 | the mean upper quartile of all libraries 295 | - Calculate M-value of each gene (log-fold change relative to reference) 296 | - Summarize library size as weighted trimmed mean of M-values. 297 | 298 | ## Dispersion 299 | 300 | `r Biocpkg("DESeq2")` `estimateDispersions()` 301 | 302 | - Estimate per-gene dispersion 303 | - Fit a smoothed relationship between dispersion and abundance 304 | 305 | `r Biocpkg("edgeR")` `estimateDisp()` 306 | 307 | - Common: single dispersion for all genes; appropriate for small 308 | experiments (<10? samples) 309 | - Tagwise: different dispersion for all genes; appropriate for larger 310 | / well-behaved experiments 311 | - Trended: bin based on abundance, estimate common dispersion within 312 | bin, fit a loess-smoothed relationship between binned dispersion and 313 | abundance 314 | 315 | # Analysis of designed experiments in R 316 | 317 | ## Example: t-test 318 | 319 | `t.test()` 320 | 321 | - `x`: vector of univariate measurements 322 | - `y`: `factor` describing experimental design 323 | - `var.equal=TRUE`: appropriate for relatively small experiments where 324 | no additional information available? 325 | - `formula`: alternative representation, `y ~ x`. 326 | 327 | ```{r sleep-t.test} 328 | head(sleep) 329 | plot(extra ~ group, data = sleep) 330 | ## Traditional interface 331 | with(sleep, t.test(extra[group == 1], extra[group == 2])) 332 | ## Formula interface 333 | t.test(extra ~ group, sleep) 334 | ## equal variance between groups 335 | t.test(extra ~ group, sleep, var.equal=TRUE) 336 | ``` 337 | 338 | `lm()` and `anova()` 339 | 340 | - `lm()`: fit _linear model_. 341 | - `anova()`: statisitcal evaluation. 342 | 343 | ```{r sleep-lm} 344 | ## linear model; compare to t.test(var.equal=TRUE) 345 | fit <- lm(extra ~ group, sleep) 346 | anova(fit) 347 | ``` 348 | 349 | - Under the hood: `formula`: translated into _model matrix_, used in 350 | `lm.fit()`. 351 | - With (implicit) intercept 1, last coefficient of model matrix 352 | reflects group effect 353 | - With intercept 0, _contrast_ between effects of coefficient 1 and 354 | coefficient 2 reflect group effect 355 | 356 | ```{r sleep-model.matrix} 357 | ## underlying model, used in `lm.fit()` 358 | model.matrix(extra ~ group, sleep) # last column indicates group effect 359 | model.matrix(extra ~ 0 + group, sleep) # contrast between columns 360 | ``` 361 | 362 | - Covariate -- fit base model containing only covariate, test 363 | improvement in fit when model includes factor of interest 364 | 365 | ```{r sleep-diff} 366 | fit0 <- lm(extra ~ ID, sleep) 367 | fit1 <- lm(extra ~ ID + group, sleep) 368 | anova(fit0, fit1) 369 | t.test(extra ~ group, sleep, var.equal=TRUE, paired=TRUE) 370 | ``` 371 | 372 | `genefilter::rowttests()` 373 | 374 | - t-tests for gene expression data 375 | - useful for exploratory analysis, but statistically sub-optimal 376 | - `x`: matrix of expression values 377 | - features x samples (reverse of how a 'statistician' would 378 | represent the data -- samples x features) 379 | 380 | - `fac`: factor of one or two levels describing experimental design 381 | 382 | Limitations 383 | 384 | - Assumes features are _independent_ 385 | - Ignores common experimental design 386 | - Ignores multiple testing 387 | 388 | Consequences 389 | 390 | - Poor estimate of between-group variance for each feature 391 | - Elevated false discovery rate 392 | 393 | ## Common experimental designs 394 | 395 | - t-test: `count ~ factor`. Alternative: `count ~ 0 + factor` and 396 | contrasts 397 | - covariates: `count ~ covariate + factor` 398 | - Single factor, multiple levels (one-way ANOVA) -- statistical 399 | contrasts: specify model as `count ~ factor` or `count ~ 0 + factor` 400 | - Factorial designs -- main effects, `count ~ factor1 + factor2`; main 401 | effects and interactions, `count ~ factor1 * factor2`. Contrasts to 402 | ask specific questions 403 | - Paired designs: include ID as covariate (approximate, since ID is a 404 | random effect); `r Biocpkg("limma")` approach: 405 | `duplicateCorrelation()` 406 | 407 | # Practical: RNA-Seq gene-level differential expression 408 | 409 | Adapted from Love, Anders, and Huber's Bioconductor 410 | [work flow](http://bioconductor.org/help/workflows/rnaseqGene/) 411 | 412 | Michael Love [1], Simon Anders [2], Wolfgang Huber [2] 413 | 414 | [1] Department of Biostatistics, Dana-Farber Cancer Institute and 415 | Harvard School of Public Health, Boston, US; 416 | 417 | [2] European Molecular Biology Laboratory (EMBL), Heidelberg, Germany. 418 | 419 | ## 1. Experimental design 420 | 421 | The data used in this workflow is an RNA-Seq experiment of airway 422 | smooth muscle cells treated with dexamethasone, a synthetic 423 | glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids 424 | are used, for example, in asthma patients to prevent or reduce 425 | inflammation of the airways. In the experiment, four primary human 426 | airway smooth muscle cell lines were treated with 1 micromolar 427 | dexamethasone for 18 hours. For each of the four cell lines, we have a 428 | treated and an untreated sample. The reference for the experiment is: 429 | 430 | Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, 431 | Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, 432 | Tantisira KG, Weiss ST, Lu Q. "RNA-Seq Transcriptome Profiling 433 | Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates 434 | Cytokine Function in Airway Smooth Muscle Cells." PLoS One. 2014 Jun 435 | 13;9(6):e99625. 436 | PMID: [24926665](http://www.ncbi.nlm.nih.gov/pubmed/24926665). 437 | GEO: [GSE52778](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778). 438 | 439 | ## 2, 3, and 4: Wet lab, sequencing, and alignment 440 | 441 | - Paired-end sequencing leading to 442 | [FASTQ](http://en.wikipedia.org/wiki/FASTQ_format) files of reads 443 | and their quality scores. 444 | 445 | - Reads aligned to a reference genome or transcriptome, resulting in 446 | [BAM](http://samtools.github.io/hts-specs) files. Reads for this 447 | experiment were aligned to the Ensembl release 75 human reference 448 | genome using the [STAR](https://code.google.com/p/rna-star/) aligner 449 | 450 | ## 5. Reduction 451 | 452 | We use the `r Biocexptpkg("airway")` package to illustrate 453 | reduction. The package provides sample information, a subset of eight 454 | BAM files, and the known gene models required to count the reads. 455 | 456 | ```{r airway-bam-path} 457 | library(airway) 458 | path <- system.file(package="airway", "extdata") 459 | dir(path) 460 | ``` 461 | 462 | ### Setup 463 | 464 | The ingredients for counting include are: 465 | 466 | a. Metadata describing samples. Read this using `read.csv()`. 467 | 468 | ```{r airway-csv} 469 | csvfile <- dir(path, "sample_table.csv", full=TRUE) 470 | sampleTable <- read.csv(csvfile, row.names=1) 471 | head(sampleTable) 472 | ``` 473 | 474 | b. BAM files containing aligned reads. Create an object that 475 | references these files. What does the `yieldSize` argument mean? 476 | 477 | ```{r airway-bam} 478 | library(Rsamtools) 479 | filenames <- dir(path, ".bam$", full=TRUE) 480 | bamfiles <- BamFileList(filenames, yieldSize=1000000) 481 | names(bamfiles) <- sub("_subset.bam", "", basename(filenames)) 482 | ``` 483 | 484 | c. Known gene models. These might come from an existing `TxDb` 485 | package, or created from biomart or UCSC, or from a 486 | [GTF file](http://www.ensembl.org/info/website/upload/gff.html). We'll 487 | take the hard road, making a TxDb object from the GTF file used to 488 | align reads and using the TxDb to get all exons, grouped by gene. 489 | 490 | ```{r airway-gtf-to-txdb} 491 | library(GenomicFeatures) 492 | gtffile <- file.path(path, "Homo_sapiens.GRCh37.75_subset.gtf") 493 | txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character()) 494 | genes <- exonsBy(txdb, by="gene") 495 | ``` 496 | 497 | ### Counting 498 | 499 | After these preparations, the actual counting is easy. The function 500 | `summarizeOverlaps()` from the `r Biocpkg("GenomicAlignments")` 501 | package will do this. This produces a `SummarizedExperiment` object, 502 | which contains a variety of information about an experiment 503 | 504 | ```{r} 505 | library(GenomicAlignments) 506 | se <- summarizeOverlaps(features=genes, reads=bamfiles, 507 | mode="Union", 508 | singleEnd=FALSE, 509 | ignore.strand=TRUE, 510 | fragments=TRUE) 511 | colData(se) <- as(sampleTable, "DataFrame") 512 | se 513 | colData(se) 514 | rowData(se) 515 | head(assay(se)) 516 | ``` 517 | 518 | ## 6. Analysis using `r Biocpkg("DESeq2")` 519 | 520 | The previous section illustrates the reduction step on a subset of the 521 | data; here's the full data set 522 | 523 | ```{r airway-data} 524 | data(airway) 525 | se <- airway 526 | ``` 527 | 528 | This object contains an informative `colData` slot -- prepared as 529 | described in the `r Biocexptpkg("airway")` vignette. In particular, 530 | the `colData()` include columns describing the cell line `cell` and 531 | treatment `dex` for each sample 532 | 533 | ```{r airway-cell-dex} 534 | colData(se) 535 | ``` 536 | 537 | `r Biocpkg("DESeq2")` makes the analysis particularly easy, simply add 538 | the experimental design, run the pipeline, and extract the results 539 | 540 | ```{r airway-DESeq2-design} 541 | library(DESeq2) 542 | dds <- DESeqDataSet(se, design = ~ cell + dex) 543 | dds <- DESeq(dds) 544 | res <- results(dds) 545 | ``` 546 | 547 | Simple visualizations / sanity checks include 548 | 549 | - Look at counts of strongly differentiated genes, to get a sense of 550 | how counts translate to the summary statistics reported in the 551 | result table 552 | 553 | ```{r plotcounts, fig.width=5, fig.height=5} 554 | topGene <- rownames(res)[which.min(res$padj)] 555 | res[topGene,] 556 | plotCounts(dds, gene=topGene, intgroup=c("dex")) 557 | ``` 558 | 559 | - An 'MA' plot shows for each gene the between-group log-fold-change 560 | versus average log count; it should be funnel-shaped and 561 | approximately symmetric around `y=0`, with lots of between-treatment 562 | variation for genes with low counts. 563 | 564 | ```{r plotma} 565 | plotMA(res, ylim=c(-5,5)) 566 | ``` 567 | 568 | - Plot the distribution of (unadjusted) P values, which should be 569 | uniform (under the null) but with a peak at small P value (true 570 | positives, hopefully!) 571 | 572 | ```{r airway-DESeq2-hist} 573 | hist(res$pvalue, breaks=50) 574 | ``` 575 | 576 | - Look at a 'volcano plot' of adjusted P-value versus log fold change, 577 | to get a sense of the fraction of up- versus down-regulated genes 578 | 579 | ```{r airway-DESeq2-volcano} 580 | plot(-log10(padj) ~ log2FoldChange, as.data.frame(res), pch=20) 581 | ``` 582 | 583 | Many additional diagnostic approaches are described in the DESeq2 (and 584 | edgeR) vignettes, and in the RNA-seq gene differential expression work 585 | flow. 586 | 587 | ## 7. Comprehension 588 | 589 | see Part E, Gene Set Enrichment 590 | 591 | -------------------------------------------------------------------------------- /vignettes/D_MachineLearning.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: D. Machine Learning 3 | author: 4 | Martin Morgan (mtmorgan@fredhutch.org)
5 | Sonali Arora (sarora@fredhutch.org) 6 | output: 7 | BiocStyle::html_document: 8 | toc: true 9 | vignette: > 10 | %\VignetteIndexEntry{D. Machine Learning} 11 | %\VignetteEngine{knitr::rmarkdown} 12 | \usepackage[utf8]{inputenc} 13 | --- 14 | 15 | ```{r style, echo=FALSE, results='asis'} 16 | BiocStyle::markdown() 17 | ``` 18 | 19 | ## Introduction to Machine Learning 20 | 21 | Lets say that we are interested in predicting the run time of an athlete 22 | depending on his shoe size, height and weight in a study of 100 people. 23 | We can do so using a simple linear regression model where 24 | 25 | ```{r eval=FALSE} 26 | y = beta0 + beta1 * height + beta2 * weight + beta3 * shoe_size 27 | ``` 28 | 29 | Here y is the response variable (run time), n is the number of observations 30 | (100 people), p is the number of variables/ features/ predictors (3 IE height, 31 | weight, shoe size), X is a nxp matrix 32 | 33 | This data set is a low dimensional data where n >> p but most of the biological 34 | data sets coming out of modern biological techniques are high dimensional IE 35 | n << p This poses statistical challenge and simple linear regression can no 36 | longer help us. 37 | 38 | For example, 39 | 40 | * Identify the risk factors(genes) for prostrate cancer based on gene 41 | expression data 42 | * Predict the chances of breast cancer survival in a patient. 43 | * Identify patterns of gene expression among different sub types of 44 | breast cancer 45 | 46 | In all of the 3 examples, listed above n, number of observations, is 30-40 patients 47 | whereas p, number of features, is approximately 30,000 genes. Try writing a linear 48 | regression formula for the outcome variable, y, in any of the above three 49 | scenarios.. 50 | 51 | Listed below are things that can go wrong with high dimensional data 52 | - some of these predictors are useful, some are not 53 | - if we include too many predictors, we can over fit the data 54 | 55 | This is why we need Machine Learning. Lets first introduce some basic concepts 56 | and then dive into examples and a lab session. 57 | 58 | **Supervised Learning** - Use a data set X to predict the association with a 59 | response variable Y. The response variable can be continuous or categorical. 60 | For example: Predicting the chances of breast cancer survival in a patient. 61 | 62 | **Unsupervised Learning** - Discover the associations or patterns in X. No 63 | response variable is present. For example: Cluster similar genes into groups. 64 | 65 | **Training & Test Datasets** - Usually we split observation into test and 66 | training data sets. We fit the model on the training data set and evaluate on the 67 | test data set. The test set error rate is an estimate of the models performance 68 | on future data sets. 69 | 70 | **Model Selection** - We usually consider numerous models for a given problem. 71 | For example, we are trying to identify the genes responsible for a given 72 | disease using gene expression data set- we could have the following models 73 | a) model 1 - Use all 30000 genes from the array to build a model 74 | b) model 2 - we include only genes related to the pathway that we know is 75 | upregulated in that disease to build a model 76 | c) model 3 - include genes found in literature which are known to influence 77 | this disease 78 | It is highly recommended to use the test set only on our final model to see 79 | how our model will do with new, unseen data. So how do we pick the best 80 | model which can be tested on the test data set? 81 | 82 | **Cross-validation** 83 | We can use different approaches to find the best model. Lets look at the 84 | commonly used approaches, namely, validation set, leave one out 85 | cross-validation, k-fold cross validation. 86 | 87 | Briefly, the __validation set approach__ deals with diving the full data sets into 88 | 3 groups - training set, validation set and the test set. We train the models on 89 | the training set, evaluate their performance on the validation set and then the 90 | best model is chosen to fit on the test set. 91 | 92 | The __leave one out cross validation__ starts with fitting n models (where n is 93 | number of observations in the training data set), each on n-1 observations, 94 | evaluating each model on the left-out observation. The best model is the one 95 | for which the total test error is the smallest and that is then used to predict 96 | the test set. 97 | 98 | Lastly the __5 fold cross validation__ (here k=5), is splitting the training 99 | data set into 5 sets and repeatedly training the model on the other 4 sets and 100 | evaluating the performance on the fifth. 101 | 102 | **Bias, Variance, Overfitting** - Bias refers to the average difference between 103 | the actual betas and the predicted betas, Variance refers to the amount by 104 | which the betas differ across experiments. As the model complexity(no of 105 | variables) increases, the bias decreases and the variance increases. This is 106 | know as the Bias-Variance Tradeoff and a model that has too much of variance, 107 | is said to be over fit. 108 | 109 | ## Datasets 110 | 111 | For **Unsupervised learning**, we will use RNA-Seq count data from the 112 | Biocoductor package, `r Biocpkg("airway")`. From the abstract, a brief 113 | description of the RNA-Seq experiment on airway smooth muscle (ASM) cell 114 | lines: “Using RNA-Seq, a high-throughput sequencing method, we characterized 115 | transcriptomic changes in four primary human ASM cell lines that were treated 116 | with dexamethasone - a potent synthetic glucocorticoid (1 micromolar for 117 | 18 hours).” 118 | 119 | ```{r message=FALSE} 120 | library(airway) 121 | data("airway") 122 | se <- airway 123 | colData(se) 124 | library("DESeq2") 125 | dds <- DESeqDataSet(se, design = ~ cell + dex) 126 | ``` 127 | 128 | For **Supervised learning**, we will use cervical count data from the 129 | Biocoductor package, `r Biocpkg("MLSeq")`. This data set contains 130 | expressions of 714 miRNA's of human samples. There are 29 tumor and 29 131 | non-tumor cervical samples. For learning purposes, we can treat these 132 | as two separate groups and run various classification algorithms. 133 | 134 | ```{r message=FALSE} 135 | library(MLSeq) 136 | filepath = system.file("extdata/cervical.txt", package = "MLSeq") 137 | cervical = read.table(filepath, header = TRUE) 138 | ``` 139 | 140 | 141 | ## Unsupervised Learning 142 | 143 | Unsupervised Learning is a set of statistical tools intended for the setting 144 | in which we have only a set of 'p' features measured on 'n' observations. 145 | We are primarily interested in discovering interesting 146 | things about the 'p' features. 147 | 148 | Unsupervised Learning is often performed as a part of Exploratory Data Analysis. 149 | These tools help us to get a good idea about the data set. Unlike a supervised 150 | learning problem, where we can use prediction to gain some confidence about our 151 | learning algorithm, there is no way to check our model. The learning algorithm 152 | is thus, aptly named "unsupervised". 153 | 154 | **RLOG TRANSFORMATION** 155 | 156 | Many common statistical methods for exploratory analysis of multidimensional 157 | data, especially methods for clustering and ordination (e.g., 158 | principal-component analysis and the like), work best for (at least 159 | approximately) homoskedastic data; this means that the variance of an observed 160 | quantity (here, the expression strength of a gene) does not depend on the mean. 161 | 162 | In RNA-Seq data, the variance grows with the mean.If one performs PCA 163 | (principal components analysis) directly on a matrix of normalized read counts, 164 | the result typically depends only on the few most strongly expressed genes 165 | because they show the largest absolute differences between samples. 166 | 167 | As a solution, DESeq2 offers the regularized-logarithm transformation, or rlog 168 | for short. See the help for ?rlog for more information and options. 169 | 170 | The function rlog returns a SummarizedExperiment object which contains the 171 | rlog-transformed values in its assay slot: 172 | 173 | ```{r} 174 | rld <- rlog(dds) 175 | head(assay(rld)) 176 | ``` 177 | 178 | To assess overall similarity between samples: Which samples are similar to each 179 | other, which are different? Does this fit to the expectation from the 180 | experiment's design? We use the R function dist to calculate the Euclidean 181 | distance between samples. To avoid that the distance measure is dominated by 182 | a few highly variable genes, and have a roughly equal contribution from all 183 | genes, we use it on the rlog-transformed data 184 | 185 | ```{r} 186 | sampleDists <- dist( t( assay(rld) ) ) 187 | sampleDists 188 | ``` 189 | Note the use of the function t to transpose the data matrix. We need this 190 | because dist calculates distances between data rows and our samples constitute 191 | the columns. 192 | 193 | **HEATMAP** 194 | 195 | We visualize the sample-to-sample distances in a heatmap, using the 196 | function heatmap.2 from the gplots package. Note that we have changed the row 197 | names of the distance matrix to contain treatment type and patient number 198 | instead of sample ID, so that we have all this information in view when 199 | looking at the heatmap. 200 | 201 | ```{r message=FALSE} 202 | library("gplots") 203 | library("RColorBrewer") 204 | 205 | sampleDistMatrix <- as.matrix( sampleDists ) 206 | rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep="-" ) 207 | colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) 208 | hc <- hclust(sampleDists) 209 | heatmap.2( sampleDistMatrix, Rowv=as.dendrogram(hc), 210 | symm=TRUE, trace="none", col=colors, 211 | margins=c(2,10), labCol=FALSE ) 212 | 213 | ``` 214 | 215 | **PCA** 216 | 217 | Another way to visualize sample-to-sample distances is a principal-components 218 | analysis (PCA). In this ordination method, the data points (i.e., here, the 219 | samples) are projected onto the 2D plane such that they spread out in the two 220 | directions which explain most of the differences in the data. The x-axis is 221 | the direction (or principal component) which separates the data points the most. 222 | The amount of the total variance which is contained in the direction is 223 | printed in the axis label. Here, we have used the function plotPCA which comes 224 | with DESeq2. The two terms specified by intgroup are the interesting groups 225 | for labelling the samples; they tell the function to use them to choose colors. 226 | 227 | ```{r} 228 | plotPCA(rld, intgroup = c("dex", "cell")) 229 | ``` 230 | 231 | From both visualizations, we see that the differences between cells are 232 | considerable, though not stronger than the differences due to treatment 233 | with dexamethasone. This shows why it will be important to account for this 234 | in differential testing by using a paired design (“paired”, because each dex 235 | treated sample is paired with one untreated sample from the same cell line). 236 | We are already set up for this by using the design formula ~ cell + dex when 237 | setting up the data object in the beginning. 238 | 239 | **MDS** 240 | Another plot, very similar to the PCA plot, can be made using the 241 | multidimensional scaling (MDS) function in base R. This is useful when we don't 242 | have the original data, but only a matrix of distances. Here we have the MDS 243 | plot for the distances calculated from the rlog transformed counts: 244 | 245 | ```{r} 246 | library(ggplot2) 247 | mds <- data.frame(cmdscale(sampleDistMatrix)) 248 | mds <- cbind(mds, colData(rld)) 249 | qplot(X1,X2,color=dex,shape=cell,data=as.data.frame(mds)) 250 | ``` 251 | 252 | ### Exercise: 253 | Use the plotMDS function from the limma package to make a simila plot. 254 | What is the advtange of using this function over base R's cmdscale? 255 | 256 | **Solutions:** 257 | 258 | A similar plot can be made using the plotMDS() function in limma where the input 259 | is a matrix of log-fold expression values. Here the advantage is that the 260 | distances on plot are proportional to log2-fold change and not only is the plot 261 | created, but the object (with distance matrix) is also returned. 262 | 263 | ```{r plotMDS} 264 | suppressPackageStartupMessages({ 265 | library(limma) 266 | library(DESeq2) 267 | library(airway) 268 | }) 269 | plotMDS(assay(rld), col=as.integer(dds$dex), pch=as.integer(dds$cell)) 270 | ``` 271 | 272 | 273 | 274 | ## Supervised Learning 275 | 276 | In supervised learning, along with the 'p' features, we 277 | also have the a response Y measured on the same n observations. The goal is then 278 | to predict Y using X (n x p matrix) for new observations. 279 | 280 | For the cervical data, we know that the first 29 are non-Tumor samples 281 | whereas the last 29 are Tumor samples. We will code these as 0 and 1 282 | respectively. We will randomly sample 30% of our data and use that as a 283 | test set. The remaining 70% of the data will be used as training data 284 | 285 | ```{r } 286 | set.seed(9) 287 | 288 | class = data.frame(condition = factor(rep(c(0, 1), c(29, 29)))) 289 | 290 | nTest = ceiling(ncol(cervical) * 0.2) 291 | ind = sample(ncol(cervical), nTest, FALSE) 292 | 293 | cervical.train = cervical[, -ind] 294 | cervical.train = as.matrix(cervical.train + 1) 295 | classtr = data.frame(condition = class[-ind, ]) 296 | 297 | cervical.test = cervical[, ind] 298 | cervical.test = as.matrix(cervical.test + 1) 299 | classts = data.frame(condition = class[ind, ]) 300 | ``` 301 | 302 | MLSeq aims to make computation less complicated for a user and 303 | allows one to learn a model using various classifier's with one single function. 304 | 305 | The main function of this package is classify which requires data in the form of 306 | a DESeqDataSet instance. The DESeqDataSet is a subclass of SummarizedExperiment, 307 | used to store the input values, intermediate calculations and results of an 308 | analysis of differential expression. 309 | 310 | So lets create DESeqDataSet object for both the training and test set, and run 311 | DESeq on it. 312 | 313 | ```{r} 314 | cervical.trainS4 = DESeqDataSetFromMatrix(countData = cervical.train, 315 | colData = classtr, formula(~condition)) 316 | cervical.trainS4 = DESeq(cervical.trainS4, fitType = "local") 317 | 318 | cervical.testS4 = DESeqDataSetFromMatrix(countData = cervical.test, colData = classts, 319 | formula(~condition)) 320 | cervical.testS4 = DESeq(cervical.testS4, fitType = "local") 321 | 322 | ``` 323 | Classify using Support Vector Machines. 324 | 325 | ```{r} 326 | svm = classify(data = cervical.trainS4, method = "svm", normalize = "deseq", 327 | deseqTransform = "vst", cv = 5, rpt = 3, ref = "1") 328 | svm 329 | ``` 330 | 331 | It returns an object of class 'MLseq' and we observe that it successfully 332 | fitted a model with 97.8% accuracy. We can access the slots of this S4 object by 333 | ```{r} 334 | getSlots("MLSeq") 335 | ``` 336 | And also, ask about the model trained. 337 | 338 | ```{r} 339 | trained(svm) 340 | ``` 341 | 342 | We can predict the class labels of our test data using "predict" 343 | 344 | ```{r} 345 | pred.svm = predictClassify(svm, cervical.testS4) 346 | table(pred.svm, relevel(cervical.testS4$condition, 2)) 347 | ``` 348 | 349 | The other classification methods available are 'randomforest', 'cart' and 350 | 'bagsvm'. 351 | 352 | ### Exercise: 353 | 354 | Train the same training data and test data using randomForest. 355 | 356 | **Solutions:** 357 | 358 | ```{r} 359 | rf = classify(data = cervical.trainS4, method = "randomforest", 360 | normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "1") 361 | trained(rf) 362 | pred.rf = predictClassify(rf, cervical.testS4) 363 | table(pred.rf, relevel(cervical.testS4$condition, 2)) 364 | ``` 365 | 366 | ## SessionInfo 367 | 368 | ```{r} 369 | sessionInfo() 370 | ``` 371 | 372 | ## References 373 | 374 | 1. Zararsiz G, Goksuluk D, Korkmaz S, Eldem V, Duru IP, Unver T and Ozturk A (2014). MLSeq: Machine learning interface for RNA-Seq data. R package version 1.3.0. 375 | 2. Himes, E. B, Jiang, X., Wagner, P., Hu, R., Wang, Q., Klanderman, B., Whitaker, M. R, Duan, Q., Lasky-Su, J., Nikolos, C., Jester, W., Johnson, M., Panettieri, A. R, Tantisira, G. K, Weiss, T. S, Lu and Q. (2014). “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS ONE, 9(6), pp. e99625. http://www.ncbi.nlm.nih.gov/pubmed/24926665. 376 | 3. An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani 377 | 4. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani, Jerome Friedman 378 | 379 | -------------------------------------------------------------------------------- /vignettes/E_GeneSetEnrichment.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: E. Gene Set Enrichiment 3 | author: Martin Morgan (mtmorgan@fredhutch.org) 4 | date: "`r Sys.Date()`" 5 | output: 6 | BiocStyle::html_document: 7 | toc: true 8 | vignette: > 9 | %\VignetteIndexEntry{E. Gene Set Enrichment} 10 | %\VignetteEngine{knitr::rmarkdown} 11 | \usepackage[utf8]{inputenc} 12 | --- 13 | 14 | ```{r style, echo=FALSE, results='asis'} 15 | BiocStyle::markdown() 16 | suppressPackageStartupMessages({ 17 | library(edgeR) 18 | library(goseq) 19 | library(org.Hs.eg.db) 20 | library(GO.db) 21 | }) 22 | ``` 23 | 24 | # Motivation 25 | 26 | Is expression of genes in a gene set associated with experimental 27 | condition? 28 | 29 | - E.g., Are there unusually many up-regulated genes in the gene set? 30 | 31 | Many methods, a recent review is Kharti et al., 2012. 32 | 33 | - Over-representation analysis (ORA) -- are differentially expressed 34 | (DE) genes in the set more common than expected? 35 | - Functional class scoring (FCS) -- summarize statistic of DE of genes 36 | in a set, and compare to null 37 | - Pathway topology (PT) -- include pathway knowledge in assessing DE 38 | of genes in a set 39 | 40 | ## What is a gene set? 41 | 42 | **Any** _a priori_ classification of `genes' into biologically 43 | relevant groups 44 | 45 | - Members of same biochemical pathway 46 | - Proteins expressed in identical cellular compartments 47 | - Co-expressed under certain conditions 48 | - Targets of the same regulatory elements 49 | - On the same cytogenic band 50 | - ... 51 | 52 | Sets do not need to be... 53 | 54 | - Exhaustive 55 | - Disjoint 56 | 57 | ## Collections of gene sets 58 | 59 | Gene Ontology ([GO](http://geneontology.org)) Annotation (GOA) 60 | 61 | - CC Cellular Components 62 | - BP Biological Processes 63 | - MF Molecular Function 64 | 65 | Pathways 66 | 67 | - [MSigDb](http://www.broadinstitute.org/gsea/msigdb/) 68 | - [KEGG](http://genome.jp/kegg) (no longer freely available) 69 | - [reactome](http://reactome.org) 70 | - [PantherDB](http://pantherdb.org) 71 | - ... 72 | 73 | E.g., [MSigDb](http://www.broadinstitute.org/gsea/msigdb/) 74 | 75 | - c1 Positional gene sets -- chromosome \& cytogenic band 76 | - c2 Curated Gene Sets from online pathway databases, 77 | publications in PubMed, and knowledge of domain experts. 78 | - c3 motif gene sets based on conserved cis-regulatory motifs 79 | from a comparative analysis of the human, mouse, rat, and dog 80 | genomes. 81 | - c4 computational gene sets defined by mining large collections 82 | of cancer-oriented microarray data. 83 | - c5 GO gene sets consist of genes annotated by the same GO 84 | terms. 85 | - c6 oncogenic signatures defined directly from microarray gene 86 | expression data from cancer gene perturbations. 87 | - c7 immunologic signatures defined directly from microarray 88 | gene expression data from immunologic studies. 89 | 90 | # Statistical approaches 91 | 92 | Initially based on a presentation by Simon Anders, 93 | [CSAMA 2010](http://marray.economia.unimi.it/2009/material/lectures/L8_Gene_Set_Testing.pdf) 94 | 95 | ## Approach 1: hypergeometric tests 96 | 97 | Steps 98 | 99 | 1. Classify each gene as 'differentially expressed' DE or not, e.g., 100 | based on _P_ < 0.05 101 | 2. Are DE genes in the set more common than DE genes not in the set? 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 |
In gene set?
YesNo
DifferentiallyYeskK
expressed?Non - kN - K
132 | 133 | 3. Fisher hypergeometric test, via `fiser.test()` or `r Biocpkg("GOstats")` 134 | 135 | Notes 136 | 137 | - Conditional hypergeometric to accommodate GO DAG, `r Biocpkg("GOstats")` 138 | - But: artificial division into two groups 139 | 140 | ## Approach 2: enrichment score 141 | 142 | - Mootha et al., 2003; modified Subramanian et al., 2005. 143 | 144 | Steps 145 | 146 | - Sort genes by log fold change 147 | - Calculate running sum: incremented when gene in set, decremented when not. 148 | - Maximum of the running sum is enrichment score ES; large ES means 149 | that genes in set are toward top of list. 150 | - Permuting subject labels for signficance 151 | 152 | 153 | 154 | ## Approach 3: category $t$-test 155 | 156 | E.g., Jiang \& Gentleman, 2007; \Biocpkg{Category} 157 | 158 | - Summarize $t$ (or other) statistic in each set 159 | - Test for significance by permuting the subject labels 160 | - Much more straight-forward to implement 161 | 162 | Expression in NEG vs BCR/ABL samples for genes in the 'ribosome' KEGG 163 | pathway; `r Biocpkg("Category")` vignette. 164 | 165 | 166 | 167 | 168 | ## Competitive versus self-contained null hypothesis 169 | 170 | Goemann & Bühlmann, 2007 171 | 172 | - Competitive null: The genes in the gene set do not have stronger 173 | association with the subject condition than other genes. (Approach 174 | 1, 2) 175 | - Self-contained null: The genes in the gene set do not have any 176 | association with the subject condition. (Approach 3) 177 | - Probably, self-contained null is closer to actual question of interest 178 | - Permuting subjects (rather than genes) is appropriate 179 | 180 | ## Approach 4: linear models 181 | 182 | E.g., Hummel et al., 2008, \Biocpkg{GlobalAncova} 183 | 184 | - Colorectal tumors have good ('stage II') or bad ('stage III') 185 | prognosis. Do genes in the p53 pathway (_just one gene set!_) show 186 | different activity at the two stages? 187 | - Linear model incorporates covariates -- sex of patient, location of tumor 188 | 189 | `r Biocpkg("limma")` 190 | 191 | - Majewski et al., 2010 `romer()` and Wu \& Smythe 2012 `camera()` for 192 | enrichment (competitive null) linear models 193 | - Wu et al., 2010: `roast()`, `mroast()` for self-contained null 194 | linear models 195 | 196 | ## Approach 5: pathway topology 197 | 198 | E.g., Tarca et al., 2009, \Biocpkg{SPIA} 199 | 200 | - Incorporate pathway topology (e.g., interactions between gene products) into signficance testing 201 | 202 | - Signaling Pathway Impact Analysis 203 | 204 | - Combined evidence: pathway over-representation $P_{NDE}$; unusual 205 | signaling $P_{PERT}$ (equation 1 of Tarca et al.) 206 | 207 | Evidence plot, colorectal cancer. Points: pathway gene sets. 208 | Significant after Bonferroni (red) or FDR (blue) correction. 209 | 210 | 211 | 212 | ## Issues with sequence data? 213 | 214 | - All else being equal, long genes receive more reads than short genes 215 | - Per-gene $P$ values proportional to gene size 216 | 217 | E.g., Young et al., 2010, `r Biocpkg("goseq")` 218 | 219 | - Hypergeometric, weighted by gene size 220 | - Substantial differences 221 | - Better: read depth?? 222 | 223 | DE genes vs. transcript length. Points: bins of 300 genes. Line: 224 | fitted probability weighting function. 225 | 226 | 227 | 228 | ## Approach 6: _de novo_ discovery 229 | 230 | - So far: analogous to supervised machine learning, where pathways are 231 | known in advance 232 | - What about unsupervised discovery? 233 | 234 | Example: Langfelder & Hovarth, 235 | [WGCNA](http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/) 236 | 237 | - Weighted correlation network analysis 238 | - Described in Langfelder & Horvath, 239 | [2008](http://www.biomedcentral.com/1471-2105/9/559) 240 | 241 | ## Representing gene sets in R 242 | 243 | - Named `list()`, where names of the list are sets, and each element 244 | of the list is a vector of genes in the set. 245 | - `data.frame()` of set name / gene name pairs 246 | - `r Biocpkg("GSEABase")` 247 | 248 | ## Conclusions 249 | 250 | Gene set enrichment classifications 251 | 252 | - Kharti et al: Over-representation analysis; functional class 253 | scoring; pathway topology 254 | - Goemann \& Bühlmann: Competitive vs.\ self-contained null 255 | 256 | Selected \Bioconductor{} Packages 257 | 258 | | Approach | Packages | 259 | |-------------------|---------------------------------------------| 260 | | Hypergeometric | `r Biocpkg("GOstats")`, `r Biocpkg("topGO")`| 261 | | Enrichment | `r Biocpkg("limma")``::romer()` | 262 | | Category $t$-test | `r Biocpkg("Category")` | 263 | | Linear model | `r Biocpkg("GlobalAncova")`, `r Biocpkg("GSEAlm")`, `r Biocpkg("limma")``::roast()` | 264 | | Pathway topology | `r Biocpkg("SPIA")` | 265 | | Sequence-specific | `r Biocpkg("goseq")` | 266 | | _de novo_ | `r CRANpkg("WGCNA")` | 267 | 268 | # Practical 269 | 270 | This practical is based on section 6 of the `r Biocpkg("goseq")` 271 | [vignette](http://bioconductor.org/packages/devel/bioc/vignettes/goseq/inst/doc/goseq.pdf). 272 | 273 | ## 1-6 Experimental design, ..., Analysis of gene differential expression 274 | 275 | This (relatively old) experiment examined the effects of androgen 276 | stimulation on a human prostate cancer cell line, LNCaP (Li et al., 277 | [2008](https://doi.org/10.1073/pnas.0807121105)). The experiment 278 | used short (35bp) single-end reads from 4 control and 3 untreated 279 | lines. Reads were aligned to hg19 using Bowtie, and counted using 280 | ENSEMBL 54 gene models. 281 | 282 | Input the data to `r Biocpkg("edgeR")`'s `DGEList` data structure. 283 | 284 | ```{r prostate-edgeR-input} 285 | library(edgeR) 286 | path <- system.file(package="goseq", "extdata", "Li_sum.txt") 287 | 288 | table.summary <- read.table(path, sep='\t', header=TRUE, stringsAsFactors=FALSE) 289 | counts <- table.summary[,-1] 290 | rownames(counts) <- table.summary[,1] 291 | grp <- factor(rep(c("Control","Treated"), times=c(4,3))) 292 | summarized <- DGEList(counts, lib.size=colSums(counts), group=grp) 293 | ``` 294 | 295 | Use a 'common' dispersion estimate, and compare the two groups using 296 | an exact test 297 | 298 | ```{r prostate-edgeR-de} 299 | disp <- estimateCommonDisp(summarized) 300 | tested <- exactTest(disp) 301 | topTags(tested) 302 | ``` 303 | 304 | ## 7. Comprehension 305 | 306 | Start by extracting all P values, then correcting for multiple 307 | comparison using `p.adjust()`. Classify the genes as differentially 308 | expressed or not. 309 | 310 | ```{r prostate-edgeR-padj} 311 | padj <- with(tested$table, { 312 | keep <- logFC != 0 313 | value <- p.adjust(PValue[keep], method="BH") 314 | setNames(value, rownames(tested)[keep]) 315 | }) 316 | genes <- padj < 0.05 317 | table(genes) 318 | ``` 319 | 320 | ### Gene symbol to pathway 321 | 322 | Under the hood, `r Biocpkg("goseq")` uses Bioconductor annotation 323 | packages (in this case `r Biocannopkg("org.Hs.eg.db")` and `r 324 | Biocannopkg("GO.db")` to map from gene symbols to GO pathways. 325 | 326 | Expore these packages through the `columns()` and `select()` 327 | functions. Can you map between ENSEMBL gene identifiers (the row names 328 | of `topTable()`) to GO pathway? What about 'drilling down' on 329 | particular GO identifiers to discover the term's definition? 330 | 331 | ### Probability weighting function 332 | 333 | Calculate the weighting for each gene. This looks up the gene lengths 334 | in a pre-defined table (how could these be calculated using TxDb 335 | packages? What challenges are associated with calculating these 336 | 'weights', based on the knowledge that genes typically consist of 337 | several transcripts, each expressed differently?) 338 | 339 | ```{r prostate-edgeR-pwf} 340 | pwf <- nullp(genes,"hg19","ensGene") 341 | head(pwf) 342 | ``` 343 | 344 | ### Over- and under-representation 345 | 346 | Perform the main analysis. This includes association of genes to GO 347 | pathway 348 | 349 | ```{r prostate-goseq-wall} 350 | GO.wall <- goseq(pwf, "hg19", "ensGene") 351 | head(GO.wall) 352 | ``` 353 | 354 | ### What if we'd ignored gene length? 355 | 356 | Here we do the same operation, but ignore gene lengths 357 | 358 | ```{r prostate-goseq-nobias} 359 | GO.nobias <- goseq(pwf,"hg19","ensGene",method="Hypergeometric") 360 | ``` 361 | 362 | Compare the over-represented P-values for each set, under the 363 | different methods 364 | 365 | ```{r prostate-goseq-compare, fig.width=5, fig.height=5} 366 | idx <- match(GO.nobias$category, GO.wall$category) 367 | plot(log10(GO.nobias[, "over_represented_pvalue"]) ~ 368 | log10(GO.wall[idx, "over_represented_pvalue"]), 369 | xlab="Wallenius", ylab="Hypergeometric", 370 | xlim=c(-5, 0), ylim=c(-5, 0)) 371 | abline(0, 1, col="red", lwd=2) 372 | ``` 373 | 374 | # References 375 | 376 | - Khatri et al., 2012, PLoS Comp Biol 8.2: e1002375. 377 | - Subramanian et al., 2005, PNAS 102.43: 15545-15550. 378 | - Jiang \& Gentleman, 2007, Bioinformatics Feb 1;23(3):306-13. 379 | - Goeman \& B\"uhlmann, 2007, Bioinformatics 23.8: 980-987. 380 | - Hummel et al., 2008, Bioinformatics 24.1: 78-85. 381 | - Wu \& Smyth 2012, Nucleic Acids Research 40, e133. 382 | - Wu et al., 2010 Bioinformatics 26, 2176-2182. 383 | - Majewski et al., 2010, Blood, published online 5 May 2010. 384 | - Tarca et al., 2009, Bioinformatics 25.1: 75-82. 385 | - Young et al., 2010, Genome Biology 11:R14. 386 | -------------------------------------------------------------------------------- /vignettes/F_ChIPSeq.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: F. ChIP-seq 3 | author: Martin Morgan (mtmorgan@fredhutch.org) 4 | date: "`r Sys.Date()`" 5 | output: 6 | BiocStyle::html_document: 7 | markdown_strict: true 8 | toc: true 9 | vignette: > 10 | %\VignetteIndexEntry{F. ChIP-seq} 11 | %\VignetteEngine{knitr::rmarkdown} 12 | \usepackage[utf8]{inputenc} 13 | --- 14 | 15 | ```{r style, echo=FALSE, results='asis'} 16 | BiocStyle::markdown() 17 | ``` 18 | 19 | ```{r setup, echo=FALSE} 20 | options(digits=3) 21 | suppressPackageStartupMessages({ 22 | library(csaw) 23 | library(edgeR) 24 | library(GenomicRanges) 25 | library(ChIPseeker) 26 | library(genefilter) 27 | }) 28 | ``` 29 | 30 | # Motivation & work flow 31 | 32 | Key references 33 | 34 | - Kharchenko, Tolstorukov, and Park 35 | ([2008](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2597701/)). 36 | - Lun and Smyth ([2014](https://doi.org/10.1093/nar/gku351)). 37 | 38 | ## ChIP-seq 39 | 40 | Kharchenko et 41 | al. ([2008](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2597701/)). 42 | ![ChIP-seq Overview](our_figures/ChIPSeq_nbt-1508-F1.jpg) 43 | 44 | - Tags versus sequenced reads; single-end read extension in 3' 45 | direction 46 | - Strand shift / cross-correlation 47 | - Defined (narrow, e.g., transcription factor binding sites) versus 48 | diffuse (e.g., histone marks) peaks 49 | 50 | ChIP-seq for differential binding 51 | 52 | - Designed experiment with replicate samples per treatment 53 | - Analysis using insights from microarrays / RNA-seq 54 | 55 | Novel statistical issues 56 | 57 | - Inferring peaks without 'data snooping' (using the same data twice, 58 | once to infer peaks, once to estimate differential binding) 59 | - Retaining power 60 | - Minimizing false discovery rate 61 | 62 | ## Work flow 63 | 64 | - Following Bailey et al., 65 | [2013](https://doi.org/10.1371/journal.pcbi.1003326) 66 | 67 | ![](our_figures/ChIPSeq-workflow.png) 68 | 69 | Experimental design and execution 70 | 71 | - Single sample 72 | 73 | - ChIPed transcription factor and\ldots 74 | - Input (fragmented genomic DNA) or control (e.g., IP with 75 | non-specific antibody such as immunoglobulin G, IgG) 76 | 77 | - Designed experiments 78 | 79 | - Replication of TF / control pairs 80 | 81 | Sequencing & alignment 82 | 83 | - Sequencing depth rules of thumb: $>10M$ reads for narrow peaks, 84 | $>20M$ for broad peaks 85 | - Long & paired end useful but not essential -- alignment in ambiguous 86 | regions 87 | - Basic aligners generally adequate, e.g., no need to align splice 88 | junctions 89 | - Sims et al., [2014](https://doi.org/10.1038/nrg3642) 90 | 91 | Peak calling 92 | 93 | - Very large number of peak calling programs; some specialized for 94 | e.g., narrow vs. broad peaks. 95 | - Commmonly used: [MACS](http://liulab.dfci.harvard.edu/MACS/), 96 | PeakSeq, CisGenome, ... 97 | - MACS: Model-based Analysis for ChIP-Seq, Liu et al., 98 | [2008](https://doi.org/10.1186/gb-2008-9-9-r137) 99 | 100 | - Scale control tag counts to match ChIP counts 101 | - Center peaks by shifting $d/2$ 102 | - Model occurrence of a tag as a Poisson process 103 | - Look for fixed width sliding windows with exceess number of tag 104 | enrichment 105 | - Empirical FDR: Swap ChIP and control samples; FDR is \# control 106 | peaks / \# ChIP peaks 107 | - Output: BED file of called peaks 108 | 109 | Down-stream analysis 110 | 111 | - Annotation: what genes are my peaks near? 112 | - Differential representation: which peaks are over- or 113 | under-represented in treatment 1, compared to treatment 2? 114 | - Motif identification (peaks over known motifs?) and discovery 115 | - Integrative analysis, e.g., assoication of regulatory elements and 116 | expression 117 | 118 | ## Peak calling 119 | 120 | 'Known' ranges 121 | 122 | - Count tags in pre-defined ranges, e.g., promoter regions of known 123 | genes 124 | - Obvious limitations, e.g., regulatory elements not in specified 125 | ranges; specified range contains multiple regulatory elements with 126 | complementary behavior 127 | 128 | _de novo_ windows 129 | 130 | - Width: narrow peaks, 1bp; broad peaks, 150bp 131 | - Offset: 25-100bp; influencing computational burden 132 | 133 | _de novo_ peak calling 134 | 135 | - Third-party software (many available; 136 | [MACS](http://liulab.dfci.harvard.edu/MACS/) commonly used) 137 | - Various strategies for calling peaks -- Lun & Smyth, 138 | [Table 1](http://nar.oxfordjournals.org/content/42/11/e95/T1.expansion.html) 139 | 140 | - Call each sample independently; intersection or union of peaks 141 | across samples, ... 142 | - Call peaks from a pooled library 143 | - ... 144 | 145 | - Relevant slides [pdf](http://bioconductor.org/help/course-materials/2014/CSAMA2014/4_Thursday/lectures/ChIPSeq_slides.pdf) 146 | 147 | ## Peak calling across libraries 148 | 149 | - Table 1: Description of peak calling strategies. Each 150 | strategy is given an identifier and is described by the mode in 151 | which MACS is run, the libraries on which it is run and the 152 | consolidation operation (if any) performed to combine peaks between 153 | libraries or groups. For method 6, the union of the peaks in each 154 | direction of enrichment is taken. 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 |
IDModeLibraryOperation
1Single-sampleIndividualUnion
2Single-sampleIndividualIntersection
3Single-sampleIndividualAt least 2
4Single-samplePooled over groupUnion
5Single-samplePooled over groupIntersection
6Two-samplePooled over groupUnion
7Single-samplePooled over all
210 | 211 | - How to choose? -- Lun & Smyth, 212 | 213 | - Under the null hypothesis, type I error rate is uniform 214 | - [Table 2](http://nar.oxfordjournals.org/content/42/11/e95/T2.expansion.html): 215 | consequences for type I error 216 | - Best strategy: call peaks from a pooled library 217 | - Table 2: The observed type I error rate when testing 218 | for differential enrichment using counts from each peak calling 219 | strategy. Error rates for a range of specified error thresholds 220 | are shown. All values represent the mean of 10 simulation 221 | iterations with the standard error shown in brackets. RA: 222 | reference analysis using 10 000 randomly chosen true peaks. 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 |
IDError rate
 0.010.050.1
RA0.010 (0.000)0.051 (0.001)0.100 (0.002)
10.002 (0.000)0.019 (0.001)0.053 (0.001)
20.003 (0.000)0.030 (0.000)0.073 (0.001)
30.006 (0.000)0.042 (0.001)0.092 (0.001)
40.033 (0.001)0.145 (0.001)0.261 (0.002)
50.000 (0.000)0.001 (0.000)0.005 (0.000)
60.088 (0.006)0.528 (0.013)0.893 (0.006)
70.010 (0.000)0.049 (0.001)0.098 (0.001)
288 | 289 | 290 | ```{r null-p, cache=TRUE} 291 | ## 100,000 t-tests under the null, n = 6 292 | n <- 6; m <- matrix(rnorm(n * 100000), ncol=n) 293 | P <- genefilter::rowttests(m, factor(rep(1:2, each=3)))$p.value 294 | quantile(P, c(.001, .01, .05)) 295 | hist(P, breaks=20) 296 | ``` 297 | 298 | _de novo_ hybrid strategies 299 | 300 | # Practical: Differential binding (`r Biocpkg("csaw")`) 301 | 302 | This exercise is based on the `r Biocpkg("csaw")` vignette, where more 303 | detail can be found. 304 | 305 | ## 1 - 4: Experimental Design ... Alignment 306 | 307 | The experiment involves changes in binding profiles of the NFYA 308 | protein between embryonic stem cells and terminal neurons. It is a 309 | subset of the data provided by Tiwari et 310 | al. [2012](https://doi.org/10.1038/ng.1036) available as 311 | [GSE25532](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25532). There 312 | are two es (embryonic stem cell) and two tn (terminal neuron) 313 | replicates. Single-end FASTQ files were extracted from GEO, aligned 314 | using `r Biocpkg("Rsubread")`, and post-processed (sorted and indexed) 315 | using `r Biocpkg("Rsamtools")` with the script available at 316 | 317 | ```{r csaw-preprocess, eval=FALSE} 318 | system.file(package="UseBioconductor", "scripts", "ChIPSeq", "NFYA", 319 | "preprocess.R") 320 | ``` 321 | 322 | Create a data frame summarizing the files used. 323 | 324 | ```{r csaw-setup} 325 | files <- local({ 326 | acc <- c(es_1="SRR074398", es_2="SRR074399", tn_1="SRR074417", 327 | tn_2="SRR074418") 328 | data.frame(Treatment=sub("_.*", "", names(acc)), 329 | Replicate=sub(".*_", "", names(acc)), 330 | sra=sprintf("%s.sra", acc), 331 | fastq=sprintf("%s.fastq.gz", acc), 332 | bam=sprintf("%s.fastq.gz.subread.BAM", acc), 333 | row.names=acc, stringsAsFactors=FALSE) 334 | }) 335 | ``` 336 | 337 | ## 5: Reduction 338 | 339 | Change to the directory where the BAM files are located 340 | 341 | ```{r csaw-setwd, eval=FALSE} 342 | setwd("~/UseBioconductor-data/ChIPSeq/NFYA") 343 | ``` 344 | 345 | Load the csaw library and count reads in overlapping windows. This 346 | returns a `SummarizedExperiment`, so explore it a bit... 347 | 348 | ```{r csaw-reduction, eval=FALSE} 349 | library(csaw) 350 | library(GenomicRanges) 351 | frag.len <- 110 352 | system.time({ 353 | data <- windowCounts(files$bam, width=10, ext=frag.len) 354 | }) # 156 seconds 355 | acc <- sub(".fastq.*", "", data$bam.files) 356 | colData(data) <- cbind(files[acc,], colData(data)) 357 | ``` 358 | 359 | ## 6: Analysis 360 | 361 | **Filtering** (vignette Chapter 3) Start by filtering low-count 362 | windows. There are likely to be many of these (how many?). Is there a 363 | rational way to choose the filtering threshold? 364 | 365 | ```{r csaw-filter, eval=FALSE} 366 | library(edgeR) # for aveLogCPM() 367 | keep <- aveLogCPM(assay(data)) >= -1 368 | data <- data[keep,] 369 | ``` 370 | 371 | ```{r csaw-data-load, echo=FALSE} 372 | frag.len <- 110 373 | fl <- system.file(package="UseBioconductor", "extdata", "csaw-data-filtered.Rds") 374 | data <- readRDS(fl) 375 | ``` 376 | 377 | **Normalization (composition bias)** (vignette Chapter 4) csaw uses 378 | binned counts in normalization. The bins are large relative to the 379 | ChIP peaks, on the assumption that the bins primarily represent 380 | non-differentially bound regions. The sample bin counts are normalized 381 | using the `r Biocpkg("edgeR")` TMM (trimmed median of M values) method 382 | seen in the RNASeq differential expression lab. Explore vignette 383 | chapter 4 for more on normalization (this is a useful resource when 384 | seeking to develop normalization methods for other protocols!). 385 | 386 | ```{r csaw-normalize, eval=FALSE} 387 | system.time({ 388 | binned <- windowCounts(files$bam, bin=TRUE, width=10000) 389 | }) #139 second 390 | normfacs <- normalize(binned) 391 | ``` 392 | ```{r csaw-normacs-load, echo=FALSE} 393 | fl <- system.file(package="UseBioconductor", "extdata", "csaw-normfacs.Rds") 394 | normfacs <- readRDS(fl) 395 | ``` 396 | 397 | **Experimental design and Differential binding** (vignette Chapter 5) 398 | Differential binding will be assessed using `r Biocpkg("edgeR")`, 399 | where we need to specify the experimental design 400 | 401 | ```{r csaw-experimental-design} 402 | design <- model.matrix(~Treatment, colData(data)) 403 | ``` 404 | 405 | Apply a standard `r Biocpkg("edgeR")` work flow to identify 406 | differentially bound regions. Creatively explore the results. 407 | 408 | ```{r csaw-de} 409 | y <- asDGEList(data, norm.factors=normfacs) 410 | y <- estimateDisp(y, design) 411 | fit <- glmQLFit(y, design, robust=TRUE) 412 | results <- glmQLFTest(fit, contrast=c(0, 1)) 413 | head(results$table) 414 | ``` 415 | 416 | **Multiple testing** (vignette Chapter 6) The challenge is that FDR 417 | across all detected differentially bound _regions_ is what one is 418 | interested in, but what is immediately available is the FDR across 419 | differentially bound _windows_; region will often consist of multiple 420 | overlapping windows. As a first step, we'll take a 'quick and dirty' 421 | approach to identifying regions by merging 'high-abundance' windows 422 | that are within, e.g., 1kb of one another 423 | 424 | ```{r csaw-merge-windows} 425 | merged <- mergeWindows(rowRanges(data), tol=1000L) 426 | ``` 427 | 428 | Combine test results across windows within regions. Several strategies 429 | are explored in section 6.5 of the vignette. 430 | 431 | ```{r csaw-combine-merged-tests} 432 | tabcom <- combineTests(merged$id, results$table) 433 | head(tabcom) 434 | ``` 435 | 436 | Section 6.6 of the vignette discusses approaches to identifying the 437 | 'best' windows within regions. 438 | 439 | Finally, create a `GRangesList` that associated with two result tables 440 | and the genomic ranges over which the results were calculated. 441 | 442 | ```{r csaw-grangeslist} 443 | gr <- rowRanges(data) 444 | mcols(gr) <- as(results$table, "DataFrame") 445 | grl <- split(gr, merged$id) 446 | mcols(grl) <- as(tabcom, "DataFrame") 447 | ``` 448 | 449 | ## Annotation 450 | 451 | ### csaw 452 | 453 | ### ChIPseeker 454 | -------------------------------------------------------------------------------- /vignettes/GSEA/2012-07-06-Gentleman-GSEA.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/2012-07-06-Gentleman-GSEA.pdf -------------------------------------------------------------------------------- /vignettes/GSEA/Category-ribosome.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/Category-ribosome.png -------------------------------------------------------------------------------- /vignettes/GSEA/GSEA2011.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/GSEA2011.pdf -------------------------------------------------------------------------------- /vignettes/GSEA/GSEAlm.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/GSEAlm.pdf -------------------------------------------------------------------------------- /vignettes/GSEA/L8_Gene_Set_Testing.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/L8_Gene_Set_Testing.pdf -------------------------------------------------------------------------------- /vignettes/GSEA/SPIA.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/SPIA.pdf -------------------------------------------------------------------------------- /vignettes/GSEA/SPIA.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/SPIA.png -------------------------------------------------------------------------------- /vignettes/GSEA/goseq.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/goseq.pdf -------------------------------------------------------------------------------- /vignettes/GSEA/subramanian-F1-part.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/subramanian-F1-part.jpg -------------------------------------------------------------------------------- /vignettes/GSEA/subramanian-F1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/subramanian-F1.jpg -------------------------------------------------------------------------------- /vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2-cropped.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2-cropped.pdf -------------------------------------------------------------------------------- /vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2-cropped.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2-cropped.png -------------------------------------------------------------------------------- /vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2-cropped.tiff: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2-cropped.tiff -------------------------------------------------------------------------------- /vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/GSEA/young-et-al-gb-2010-11-2-r14-2.pdf -------------------------------------------------------------------------------- /vignettes/I_LargeData.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: I. Working with Large Data 3 | author: Martin Morgan (mtmorgan@fredhutch.org) 4 | date: "`r Sys.Date()`" 5 | output: 6 | BiocStyle::html_document: 7 | toc: true 8 | vignette: > 9 | %\VignetteIndexEntry{I. Working with Large Data} 10 | %\VignetteEngine{knitr::rmarkdown} 11 | \usepackage[utf8]{inputenc} 12 | --- 13 | 14 | ```{r style, echo=FALSE, results='asis'} 15 | BiocStyle::markdown() 16 | suppressPackageStartupMessages({ 17 | library(rtracklayer) 18 | library(BiocParallel) 19 | library(GenomicFiles) 20 | library(TxDb.Hsapiens.UCSC.hg19.knownGene) 21 | }) 22 | ``` 23 | 24 | # Scalabe computing 25 | 26 | Efficient _R_ code 27 | 28 | - Vectorize! 29 | - Reuse others' work -- `r Biocpkg("DESeq2")`, 30 | `r Biocpkg("GenomicRanges")`, `r Biocpkg("Biostrings")`, ..., 31 | `r CRANpkg("dplyr")`, `r CRANpkg("data.table")`, `r CRANpkg("Rcpp")` 32 | - Useful tools: `system.time()`, `Rprof()`, `r CRANpkg("microbenchmark")` 33 | - More detail in 34 | [deadly sins](http://bioconductor.org/help/course-materials/2014/CSAMA2014/1_Monday/labs/IntermediateR.html#efficient-code) 35 | of a previous course. 36 | 37 | Iteration 38 | 39 | - Chunk-wise 40 | - `open()`, read chunk(s), `close()`. 41 | - e.g., `yieldSize` argument to `Rsamtools::BamFile()` 42 | 43 | Restriction 44 | 45 | - Limit to columns and / or rows of interest 46 | - Exploit domain-specific formats, e.g., BAM files and 47 | `Rsamtools::ScanBamParam()` 48 | - Use a data base 49 | 50 | Sampling 51 | 52 | - Iterate through large data, retaining a manageable sample, e.g., 53 | `ShortRead::FastqSampler()` 54 | 55 | Parallel evaluation 56 | 57 | - **After** writing efficient code 58 | - Typically, `lapply()`-like operations 59 | - Cores on a single machine ('easy'); clusters (more tedious); 60 | clouds 61 | 62 | # File management 63 | 64 | ## File classes 65 | 66 | | Type | Example use | Name | Package | 67 | |-------|-----------------------|-----------------------------|----------------------------------| 68 | | .bed | Range annotations | `BedFile()` | `r Biocpkg("rtracklayer")` | 69 | | .wig | Coverage | `WigFile()`, `BigWigFile()` | `r Biocpkg("rtracklayer")` | 70 | | .gtf | Transcript models | `GTFFile()` | `r Biocpkg("rtracklayer")` | 71 | | | | `makeTxDbFromGFF()` | `r Biocpkg("GenomicFeatures")` | 72 | | .2bit | Genomic Sequence | `TwoBitFile()` | `r Biocpkg("rtracklayer")` | 73 | | .fastq | Reads & qualities | `FastqFile()` | `r Biocpkg("ShortRead")` | 74 | | .bam | Aligned reads | `BamFile()` | `r Biocpkg("Rsamtools")` | 75 | | .tbx | Indexed tab-delimited | `TabixFile()` | `r Biocpkg("Rsamtools")` | 76 | | .vcf | Variant calls | `VcfFile()` | `r Biocpkg("VariantAnnotation")` | 77 | 78 | ```{r rtracklayer-file-classes} 79 | ## rtracklayer menagerie 80 | library(rtracklayer) 81 | names(getClass("RTLFile")@subclasses) 82 | ``` 83 | 84 | Notes 85 | 86 | - Not a consistent interface 87 | - `open()`, `close()`, `import()` / `yield()` / `read*()` 88 | - Some: selective import via index (e.g., `.bai`, bam index); 89 | selection ('columns'); restriction ('rows') 90 | 91 | ## Managing a collection of files 92 | 93 | `*FileList()` classes 94 | 95 | - `reduceByYield()` -- iterate through a single large file 96 | - `bplapply()` (`r Biocpkg("BiocParallel")`) -- perform independent 97 | operations on several files, in parallel 98 | 99 | `GenomicFiles()` 100 | 101 | - 'rows' as genomic range restrictions, 'columns' as files 102 | - Each row x column is a _map_ from file data to useful representation 103 | in _R_ 104 | - `reduceByRange()`, `reduceByFile()`: collapse maps into summary 105 | representation 106 | - see the GenomicFiles vignette 107 | [Figure 1](http://bioconductor.org/packages/devel/bioc/vignettes/GenomicFiles/inst/doc/GenomicFiles.pdf) 108 | 109 | # Parallel evaluation with BiocParallel 110 | 111 | Standardized interface for simple parallel evaluation. 112 | 113 | - `bplapply()` instead of `lapply()` 114 | - Argument `BPPARAM` influences how parallel evaluation occurs 115 | 116 | - `MulticoreParam()`: threads on a single (non-Windows) machine 117 | - `SnowParam()`: processes on the same or different machines 118 | - `BatchJobsParam()`: resource scheduler on a cluster 119 | 120 | Other resources 121 | 122 | - [Bioconductor Amazon AMI](http://bioconductor.org/help/bioconductor-cloud-ami/) 123 | 124 | - easily 'spin up' 10's of instances 125 | - Pre-configured with Bioconductor packages and StarCluster 126 | management 127 | 128 | - `r Biocpkg("GoogleGenomics")` to interact with google compute cloud 129 | and resources 130 | 131 | 132 | # Practical 133 | 134 | ### Efficient code 135 | 136 | Define following as a function. 137 | 138 | ```{r benchmark-f0} 139 | f0 <- function(n) { 140 | ## inefficient! 141 | ans <- numeric() 142 | for (i in seq_len(n)) 143 | ans <- c(ans, exp(i)) 144 | ans 145 | } 146 | ``` 147 | 148 | Use `system.time()` to explore how long this takes to execute as `n` 149 | increases from 100 to 10000. Use `identical()` and 150 | `r CRANpkg("microbenchmark")` to compare alternatives `f1()`, `f2()`, and 151 | `f3()` for both correctness and performance of these three different 152 | functions. What strategies are these functions using? 153 | 154 | ```{r benchmark} 155 | f1 <- function(n) { 156 | ans <- numeric(n) 157 | for (i in seq_len(n)) 158 | ans[[i]] <- exp(i) 159 | ans 160 | } 161 | 162 | f2 <- function(n) 163 | sapply(seq_len(n), exp) 164 | 165 | f3 <- function(n) 166 | exp(seq_len(n)) 167 | ``` 168 | 169 | ### Sleeping serially and in parallel 170 | 171 | Go to sleep for 1 second, then return `i`. This takes 8 seconds. 172 | 173 | ```{r parallel-sleep} 174 | library(BiocParallel) 175 | 176 | fun <- function(i) { 177 | Sys.sleep(1) 178 | i 179 | } 180 | 181 | ## serial 182 | f0 <- function(n) 183 | lapply(seq_len(n), fun) 184 | 185 | ## parallel 186 | f1 <- function(n) 187 | bplapply(seq_len(n), fun) 188 | ``` 189 | 190 | ## Reads overlapping windows 191 | 192 | This exercise uses the following packages: 193 | 194 | ```{r csaw-packages} 195 | library(GenomicAlignments) 196 | library(GenomicFiles) 197 | library(BiocParallel) 198 | library(Rsamtools) 199 | library(GenomeInfoDb) 200 | ``` 201 | 202 | This is a re-implementation of the basic `r Biocpkg("csaw")` binned 203 | counts algorithm. It supposes that ChIP fragment lengths are 110 nt, 204 | and that we bin coverage in windows of width 50. We focus on chr1. 205 | 206 | ```{r olaps-chr} 207 | frag.len <- 110 208 | spacing <- 50 209 | chr <- "chr1" 210 | ``` 211 | 212 | Here we point to the bam files, indicating that we'll process the 213 | files in chunks of size 1,000,000. 214 | 215 | ```{r olaps-tileGenome} 216 | fls <- dir("~/UseBioconductor-data/ChIPSeq/NFYA/", ".BAM$", full=TRUE) 217 | names(fls) <- sub(".fastq.*", "", basename(fls)) 218 | bfl <- BamFileList(fls, yieldSize=1000000) 219 | ``` 220 | 221 | We'll creating the counting bins using `tileGenome()`, focusing the 222 | 'standard' chromosomes' 223 | 224 | ```{r csaw-tiles} 225 | len <- seqlengths(keepStandardChromosomes(seqinfo(bfl)))[chr] 226 | tiles <- tileGenome(len, tilewidth=spacing, cut.last.tile.in.chrom=TRUE) 227 | ``` 228 | 229 | We'll use `reduceByYield()` to iterate through a single file. We read 230 | to tell this function we'll `YIELD` a chunk of the file, how we'll 231 | `MAP` the chunk from it's input representation to the per-window 232 | counts, and finally how we'll `REDUCE` successive chunks into a final 233 | representation. 234 | 235 | `YIELD` is supposed to be a function that takes one argument, the 236 | input source, and returns a chunk of records 237 | 238 | ```{r yield} 239 | yield <- function(x, ...) 240 | readGAlignments(x) 241 | ``` 242 | 243 | `MAP` must take the output of yield and perhaps additional arguments, 244 | and return a vector of counts. We'll resize the genomic ranges 245 | describing the alignment so that they have a width equal to the 246 | fragment length 247 | 248 | ```{r map} 249 | map <- function(x, tiles, frag.len, ...) { 250 | gr <- keepStandardChromosomes(granges(x)) 251 | countOverlaps(tiles, resize(gr, frag.len)) 252 | } 253 | ``` 254 | 255 | `REDUCE` takes two results from `MAP` (in our case, vectors of counts) 256 | and combines them into a single result. We simply add our vectors (`+` 257 | is actually a function!) 258 | 259 | ```{r reduce} 260 | reduce <- `+` 261 | ``` 262 | 263 | To process one file, we use `reduceByYield()`, passing the file we 264 | want to process, the yield, map, and reduce functions. Our 'wrapper' 265 | function passes any additional arguments through to `reduceByYield()` 266 | using `...`: 267 | 268 | ```{r reduceByYield} 269 | count1file <- function(bf, ...) 270 | reduceByYield(bf, yield, map, reduce, ...) 271 | ``` 272 | 273 | Using `yieldSize` and `reduceByYield()` means that we do not consume 274 | too much memory processing each file, so that we can process files in 275 | parallel using `r Biocpkg("BiocParallel")`. The `simplify2array()` 276 | function transforms a list-of-vectors to a matrix. 277 | 278 | ```{r count-overlaps-parallel, eval=FALSE} 279 | counts <- bplapply(bfl, count1file, tiles=tiles, frag.len=frag.len) 280 | counts <- simplify2array(counts) 281 | dim(counts) 282 | colSums(counts) 283 | ``` 284 | 285 | # Resources 286 | 287 | - Lawrence, M, and Morgan, M. 2014. Scalable Genomics with R and 288 | Bioconductor. Statistical Science 2014, Vol. 29, No. 2, 289 | 214-226. http://arxiv.org/abs/1409.2864v1 290 | 291 | [BiocParallel]: http://bioconductor.org/packages/release/bioc/html/BiocParallel.html 292 | [GenomicFiles]: http://bioconductor.org/packages/release/bioc/html/GenomicFiles.html 293 | -------------------------------------------------------------------------------- /vignettes/our_figures/ChIPSeq-workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/ChIPSeq-workflow.png -------------------------------------------------------------------------------- /vignettes/our_figures/ChIPSeq_nbt-1508-F1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/ChIPSeq_nbt-1508-F1.jpg -------------------------------------------------------------------------------- /vignettes/our_figures/FilesToPackages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/FilesToPackages.png -------------------------------------------------------------------------------- /vignettes/our_figures/GRanges.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/GRanges.png -------------------------------------------------------------------------------- /vignettes/our_figures/GRangesImplementation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/GRangesImplementation.png -------------------------------------------------------------------------------- /vignettes/our_figures/GRangesList.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/GRangesList.png -------------------------------------------------------------------------------- /vignettes/our_figures/GRangesListImplementation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/GRangesListImplementation.png -------------------------------------------------------------------------------- /vignettes/our_figures/RangeOperations.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/RangeOperations.png -------------------------------------------------------------------------------- /vignettes/our_figures/SequencingEcosystem.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/SequencingEcosystem.png -------------------------------------------------------------------------------- /vignettes/our_figures/SummarizedExperiment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/SummarizedExperiment.png -------------------------------------------------------------------------------- /vignettes/our_figures/batch_effects_nrg2825-f2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/batch_effects_nrg2825-f2.jpg -------------------------------------------------------------------------------- /vignettes/our_figures/copy_number_QC_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/copy_number_QC_2.png -------------------------------------------------------------------------------- /vignettes/our_figures/journal.pcbi.1003118.t001.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/journal.pcbi.1003118.t001.png -------------------------------------------------------------------------------- /vignettes/our_figures/nmeth.3252-F2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/nmeth.3252-F2.jpg -------------------------------------------------------------------------------- /vignettes/our_figures/nrg2825-f2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/UseBioconductor/03a4349243c9f10609d5ec563391dd12cc149a70/vignettes/our_figures/nrg2825-f2.jpg --------------------------------------------------------------------------------