├── .gitignore ├── README.md ├── chp02.md ├── chp03.md ├── chp04.md ├── chp05.md ├── chp06.md ├── chp07.md ├── chp08.md ├── chp09.md ├── chp10.md ├── chp11.md └── compgenomr_exercises.Rproj /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rproj 3 | .Rhistory 4 | .RData 5 | .Ruserdata 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Solution manual for "Computational Genomics with R" book 2 | This repository contains the exercises and solutions for those exercises for 3 | ["Computational Genomics with R" book](https://www.routledge.com/Computational-Genomics-with-R/Akalin/p/book/9781498781855) published by CRC press. 4 | 5 | Check the individual ".md" files for each chapter for solutions. -------------------------------------------------------------------------------- /chp02.md: -------------------------------------------------------------------------------- 1 | ## Exercises and solutions for Chapter 2 2 | 3 | ### Computations in R 4 | 5 | 1. Sum 2 and 3 using the `+` operator. [Difficulty: **Beginner**] 6 | 7 | **solution:** 8 | ```{r,eval=FALSE,echo=FALSE} 9 | 2+3 10 | ``` 11 | 12 | 2. Take the square root of 36, use `sqrt()`. [Difficulty: **Beginner**] 13 | 14 | **solution:** 15 | ```{r,echo=FALSE,eval=FALSE} 16 | sqrt(36) 17 | ``` 18 | 19 | 3. Take the log10 of 1000, use function `log10()`. [Difficulty: **Beginner**] 20 | 21 | **solution:** 22 | ```{r,echo=FALSE,eval=FALSE} 23 | log10(1000) 24 | ``` 25 | 26 | 4. Take the log2 of 32, use function `log2()`. [Difficulty: **Beginner**] 27 | 28 | **solution:** 29 | ```{r,echo=FALSE,eval=FALSE} 30 | log2(32) 31 | ``` 32 | 33 | 5. Assign the sum of 2,3 and 4 to variable x. [Difficulty: **Beginner**] 34 | 35 | **solution:** 36 | ```{r,echo=FALSE,eval=FALSE} 37 | x = 2+3+4 38 | x <- 2+3+4 39 | ``` 40 | 41 | 6. Find the absolute value of the expression `5 - 145` using the `abs()` function. [Difficulty: **Beginner**] 42 | 43 | **solution:** 44 | ```{r,echo=FALSE,eval=FALSE} 45 | abs(5-145) 46 | ``` 47 | 48 | 7. Calculate the square root of 625, divide it by 5, and assign it to variable `x`.Ex: `y= log10(1000)/5`, the previous statement takes log10 of 1000, divides it by 5, and assigns the value to variable y. [Difficulty: **Beginner**] 49 | 50 | **solution:** 51 | ```{r,echo=FALSE,eval=FALSE} 52 | x = sqrt(625)/5 53 | 54 | ``` 55 | 56 | 8. Multiply the value you get from previous exercise by 10000, assign it to variable x 57 | Ex: `y=y*5`, multiplies `y` by 5 and assigns the value to `y`. 58 | **KEY CONCEPT:** results of computations or arbitrary values can be stored in variables we can re-use those variables later on and over-write them with new values. 59 | [Difficulty: **Beginner**] 60 | 61 | **solution:** 62 | ```{r,echo=FALSE,eval=FALSE} 63 | x2 = x*10000 64 | ``` 65 | 66 | ### Data structures in R 67 | 68 | 69 | 10. Make a vector of 1,2,3,5 and 10 using `c()`, and assign it to the `vec` variable. Ex: `vec1=c(1,3,4)` makes a vector out of 1,3,4. [Difficulty: **Beginner**] 70 | 71 | **solution:** 72 | ```{r,echo=FALSE,eval=FALSE} 73 | c(1:5,10) 74 | vec1=c(1,2,3,4,5,10) 75 | ``` 76 | 77 | 11. Check the length of your vector with length(). 78 | Ex: `length(vec1)` should return 3. [Difficulty: **Beginner**] 79 | 80 | **solution:** 81 | ```{r,echo=FALSE,eval=FALSE} 82 | length(vec1) 83 | ``` 84 | 85 | 12. Make a vector of all numbers between 2 and 15. 86 | Ex: `vec=1:6` makes a vector of numbers between 1 and 6, and assigns it to the `vec` variable. [Difficulty: **Beginner**] 87 | 88 | **solution:** 89 | ```{r,echo=FALSE,eval=FALSE} 90 | vec=2:15 91 | 92 | ``` 93 | 94 | 13. Make a vector of 4s repeated 10 times using the `rep()` function. Ex: `rep(x=2,times=5)` makes a vector of 2s repeated 5 times. [Difficulty: **Beginner**] 95 | 96 | **solution:** 97 | ```{r,echo=FALSE,eval=FALSE} 98 | rep(x=4,times=10) 99 | rep(4,10) 100 | ``` 101 | 102 | 14. Make a logical vector with TRUE, FALSE values of length 4, use `c()`. 103 | Ex: `c(TRUE,FALSE)`. [Difficulty: **Beginner**] 104 | 105 | **solution:** 106 | ```{r,echo=FALSE,eval=FALSE} 107 | c(TRUE,FALSE,FALSE,TRUE,FALSE) 108 | c(TRUE,TRUE,FALSE,TRUE,FALSE) 109 | 110 | ``` 111 | 112 | 15. Make a character vector of the gene names PAX6,ZIC2,OCT4 and SOX2. 113 | Ex: `avec=c("a","b","c")` makes a character vector of a,b and c. [Difficulty: **Beginner**] 114 | 115 | **solution:** 116 | ```{r,echo=FALSE,eval=FALSE} 117 | c("PAX6","ZIC2","OCT4","SOX2") 118 | 119 | ``` 120 | 121 | 16. Subset the vector using `[]` notation, and get the 5th and 6th elements. 122 | Ex: `vec1[1]` gets the first element. `vec1[c(1,3)]` gets the 1st and 3rd elements. [Difficulty: **Beginner**] 123 | 124 | **solution:** 125 | ```{r,echo=FALSE,eval=FALSE} 126 | vec1[c(5,6)] 127 | 128 | ``` 129 | 130 | 17. You can also subset any vector using a logical vector in `[]`. Run the following: 131 | 132 | ```{r subsetLogicExercise, eval=FALSE} 133 | myvec=1:5 134 | # the length of the logical vector 135 | # should be equal to length(myvec) 136 | myvec[c(TRUE,TRUE,FALSE,FALSE,FALSE)] 137 | myvec[c(TRUE,FALSE,FALSE,FALSE,TRUE)] 138 | ``` 139 | [Difficulty: **Beginner**] 140 | 141 | 142 | **solution:** 143 | ```{r,echo=FALSE,eval=FALSE} 144 | myvec=1:5 145 | myvec[c(TRUE,TRUE,FALSE,FALSE,FALSE)] # the length of the logical vector should be equal to length(myvec) 146 | myvec[c(TRUE,FALSE,FALSE,FALSE,TRUE)] 147 | ``` 148 | 149 | 18. `==,>,<, >=, <=` operators create logical vectors. See the results of the following operations: 150 | 151 | ```{r,eval=FALSE} 152 | myvec > 3 153 | myvec == 4 154 | myvec <= 2 155 | myvec != 4 156 | ``` 157 | [Difficulty: **Beginner**] 158 | 159 | 19. Use the `>` operator in `myvec[ ]` to get elements larger than 2 in `myvec` which is described above. [Difficulty: **Beginner**] 160 | 161 | 162 | **solution:** 163 | ```{r,echo=FALSE,eval=FALSE} 164 | myvec[ myvec > 2 ] 165 | 166 | ``` 167 | 168 | 169 | 20. Make a 5x3 matrix (5 rows, 3 columns) using `matrix()`. 170 | Ex: `matrix(1:6,nrow=3,ncol=2)` makes a 3x2 matrix using numbers between 1 and 6. [Difficulty: **Beginner**] 171 | 172 | **solution:** 173 | ```{r,echo=FALSE,eval=FALSE} 174 | mat=matrix(1:15,nrow=5,ncol=3) 175 | 176 | ``` 177 | 178 | 21. What happens when you use `byrow = TRUE` in your matrix() as an additional argument? 179 | Ex: `mat=matrix(1:6,nrow=3,ncol=2,byrow = TRUE)`. [Difficulty: **Beginner**] 180 | 181 | **solution:** 182 | ```{r,echo=FALSE,eval=FALSE} 183 | mat=matrix(1:15,nrow=5,ncol=3,byrow = TRUE) 184 | 185 | ``` 186 | 187 | 22. Extract the first 3 columns and first 3 rows of your matrix using `[]` notation. [Difficulty: **Beginner**] 188 | 189 | 190 | **solution:** 191 | ```{r,echo=FALSE,eval=FALSE} 192 | mat[1:3,1:3] 193 | 194 | ``` 195 | 196 | 23. Extract the last two rows of the matrix you created earlier. 197 | Ex: `mat[2:3,]` or `mat[c(2,3),]` extracts the 2nd and 3rd rows. 198 | [Difficulty: **Beginner**] 199 | 200 | **solution:** 201 | ```{r,echo=FALSE,eval=FALSE} 202 | mat[4:5,] 203 | mat[c(nrow(mat)-1,nrow(mat)),] 204 | tail(mat,n=1) 205 | tail(mat,n=2) 206 | ``` 207 | 208 | 24. Extract the first two columns and run `class()` on the result. 209 | [Difficulty: **Beginner**] 210 | 211 | **solution:** 212 | ```{r,echo=FALSE,eval=FALSE} 213 | class(mat[,1:2]) 214 | 215 | ``` 216 | 217 | 25. Extract the first column and run `class()` on the result, compare with the above exercise. 218 | [Difficulty: **Beginner**] 219 | 220 | **solution:** 221 | ```{r,echo=FALSE,eval=FALSE} 222 | class(mat[,1]) 223 | 224 | ``` 225 | 226 | 26. Make a data frame with 3 columns and 5 rows. Make sure first column is a sequence 227 | of numbers 1:5, and second column is a character vector. 228 | Ex: `df=data.frame(col1=1:3,col2=c("a","b","c"),col3=3:1) # 3x3 data frame`. 229 | Remember you need to make a 3x5 data frame. [Difficulty: **Beginner**] 230 | 231 | **solution:** 232 | ```{r,echo=FALSE,eval=FALSE} 233 | df=data.frame(col1=1:5,col2=c("a","2","3","b","c"),col3=5:1) 234 | 235 | ``` 236 | 237 | 27. Extract the first two columns and first two rows. 238 | **HINT:** Use the same notation as matrices. [Difficulty: **Beginner**] 239 | 240 | **solution:** 241 | ```{r,echo=FALSE,eval=FALSE} 242 | df[,1:2] 243 | 244 | df[1:2,] 245 | df[1:2,1:2] 246 | ``` 247 | 248 | 28. Extract the last two rows of the data frame you made. 249 | **HINT:** Same notation as matrices. [Difficulty: **Beginner**] 250 | 251 | **solution:** 252 | ```{r,echo=FALSE,eval=FALSE} 253 | df[,4:5] 254 | 255 | ``` 256 | 257 | 29. Extract the last two columns using the column names of the data frame you made. [Difficulty: **Beginner**] 258 | 259 | 260 | **solution:** 261 | ```{r,echo=FALSE,eval=FALSE} 262 | df[,c("col2","col3")] 263 | 264 | ``` 265 | 266 | 30. Extract the second column using the column names. 267 | You can use `[]` or `$` as in lists; use both in two different answers. [Difficulty: **Beginner**] 268 | 269 | **solution:** 270 | ```{r,echo=FALSE,eval=FALSE} 271 | df$col2 272 | df[,"col2"] 273 | class(df["col1"]) 274 | class(df[,"col1"]) 275 | ``` 276 | 277 | 31. Extract rows where the 1st column is larger than 3. 278 | **HINT:** You can get a logical vector using the `>` operator 279 | , and logical vectors can be used in `[]` when subsetting. [Difficulty: **Beginner**] 280 | 281 | **solution:** 282 | ```{r,echo=FALSE,eval=FALSE} 283 | df[df$col1 >3 , ] 284 | 285 | ``` 286 | 287 | 32. Extract rows where the 1st column is larger than or equal to 3. 288 | [Difficulty: **Beginner**] 289 | 290 | **solution:** 291 | ```{r,echo=FALSE,eval=FALSE} 292 | df[df$col1 >= 3 , ] 293 | 294 | ``` 295 | 296 | 33. Convert a data frame to the matrix. **HINT:** Use `as.matrix()`. 297 | Observe what happens to numeric values in the data frame. [Difficulty: **Beginner**] 298 | 299 | **solution:** 300 | ```{r,echo=FALSE,eval=FALSE} 301 | class(df[,c(1,3)]) 302 | as.matrix(df[,c(1,3)]) 303 | as.matrix(df) 304 | ``` 305 | 306 | 307 | 34. Make a list using the `list()` function. Your list should have 4 elements; 308 | the one below has 2. Ex: `mylist= list(a=c(1,2,3),b=c("apple,"orange"))` 309 | [Difficulty: **Beginner**] 310 | 311 | **solution:** 312 | ```{r,echo=FALSE,eval=FALSE} 313 | mylist= list(a=c(1,2,3), 314 | b=c("apple","orange"), 315 | c=matrix(1:4,nrow=2), 316 | d="hello") 317 | ``` 318 | 319 | 35. Select the 1st element of the list you made using `$` notation. 320 | Ex: `mylist$a` selects first element named "a". 321 | [Difficulty: **Beginner**] 322 | 323 | **solution:** 324 | ```{r,echo=FALSE,eval=FALSE} 325 | mylist$a 326 | 327 | ``` 328 | 329 | 36. Select the 4th element of the list you made earlier using `$` notation. [Difficulty: **Beginner**] 330 | 331 | **solution:** 332 | ```{r,echo=FALSE,eval=FALSE} 333 | mylist$d 334 | 335 | ``` 336 | 337 | 338 | 37. Select the 1st element of your list using `[ ]` notation. 339 | Ex: `mylist[1]` selects the first element named "a", and you get a list with one element. `mylist["a"]` selects the first element named "a", and you get a list with one element. 340 | [Difficulty: **Beginner**] 341 | 342 | **solution:** 343 | ```{r,echo=FALSE,eval=FALSE} 344 | mylist[1] # -> still a list 345 | mylist[[1]] # not a list 346 | 347 | mylist["a"] 348 | mylist[["a"]] 349 | 350 | ``` 351 | 38. Select the 4th element of your list using `[ ]` notation. [Difficulty: **Beginner**] 352 | 353 | **solution:** 354 | ```{r,echo=FALSE,eval=FALSE} 355 | mylist[4] 356 | mylist[[4]] 357 | ``` 358 | 359 | 39. Make a factor using factor(), with 5 elements. 360 | Ex: `fa=factor(c("a","a","b"))`. [Difficulty: **Beginner**] 361 | 362 | **solution:** 363 | ```{r,echo=FALSE,eval=FALSE} 364 | fa=factor(c("a","a","b","c","c")) 365 | 366 | ``` 367 | 368 | 40. Convert a character vector to a factor using `as.factor()`. 369 | First, make a character vector using `c()` then use `as.factor()`. 370 | [Difficulty: **Intermediate**] 371 | 372 | **solution:** 373 | ```{r,echo=FALSE,eval=FALSE} 374 | my.vec=c("a","a","b","c","c") 375 | fa=as.factor(my.vec) 376 | fa 377 | 378 | ``` 379 | 380 | 41. Convert the factor you made above to a character using `as.character()`. [Difficulty: **Beginner**] 381 | 382 | **solution:** 383 | ```{r,echo=FALSE,eval=FALSE} 384 | fa 385 | as.character(fa) 386 | 387 | ``` 388 | 389 | 390 | 391 | ### Reading in and writing data out in R 392 | 393 | 1. Read CpG island (CpGi) data from the compGenomRData package `CpGi.table.hg18.txt`. This is a tab-separated file. Store it in a variable called `cpgi`. Use 394 | ``` 395 | cpgFilePath=system.file("extdata", 396 | "CpGi.table.hg18.txt", 397 | package="compGenomRData") 398 | ``` 399 | to get the file path within the installed `compGenomRData` package. [Difficulty: **Beginner**] 400 | 401 | **solution:** 402 | ```{r,echo=FALSE,eval=FALSE} 403 | cpgFilePath 404 | cpgi=read.table(file=cpgFilePath,header=TRUE,sep="\t") 405 | 406 | ``` 407 | 408 | 2. Use `head()` on CpGi to see the first few rows. [Difficulty: **Beginner**] 409 | 410 | **solution:** 411 | ```{r,echo=FALSE,eval=FALSE} 412 | head(cpgi) 413 | 414 | ``` 415 | 416 | 3. Why doesn't the following work? See `sep` argument at `help(read.table)`. [Difficulty: **Beginner**] 417 | 418 | ```{r readCpGex, eval=FALSE} 419 | cpgtFilePath=system.file("extdata", 420 | "CpGi.table.hg18.txt", 421 | package="compGenomRData") 422 | cpgtFilePath 423 | cpgiSepComma=read.table(cpgtFilePath,header=TRUE,sep=",") 424 | head(cpgiSepComma) 425 | ``` 426 | 427 | **solution:** 428 | ```{r,echo=FALSE,eval=FALSE} 429 | cpgiSepComma=read.table("../data/CpGi.table.hg18.txt",header=TRUE,sep=",") 430 | head(cpgiSepComma) 431 | ``` 432 | 433 | 4. What happens when you set `stringsAsFactors=FALSE` in `read.table()`? [Difficulty: **Beginner**] 434 | ```{r readCpGex12, eval=FALSE} 435 | cpgiHF=read.table("intro2R_data/data/CpGi.table.hg18.txt", 436 | header=FALSE,sep="\t", 437 | stringsAsFactors=FALSE) 438 | ``` 439 | 440 | **solution:** 441 | The character column is now read as character instead of factor. 442 | ```{r,echo=FALSE,eval=FALSE} 443 | head(cpgiHF) 444 | head(cpgi) 445 | class(cpgiHF$V2) 446 | class(cpgiHF$V2) 447 | 448 | ``` 449 | 450 | 451 | 5. Read only the first 10 rows of the CpGi table. [Difficulty: **Beginner/Intermediate**] 452 | 453 | **solution:** 454 | ```{r,echo=FALSE,eval=FALSE} 455 | cpgi10row=read.table("../data/CpGi.table.hg18.txt",header=TRUE,sep="\t",nrow=10) 456 | cpgi10row 457 | ``` 458 | 459 | 6. Use `cpgFilePath=system.file("extdata","CpGi.table.hg18.txt",` 460 | `package="compGenomRData")` to get the file path, then use 461 | `read.table()` with argument `header=FALSE`. Use `head()` to see the results. [Difficulty: **Beginner**] 462 | 463 | 464 | **solution:** 465 | ```{r,echo=FALSE,eval=FALSE} 466 | df=read.table(cpgFilePath,header=FALSE,sep="\t") 467 | head(df) 468 | ``` 469 | 470 | 7. Write CpG islands to a text file called "my.cpgi.file.txt". Write the file 471 | to your home folder; you can use `file="~/my.cpgi.file.txt"` in linux. `~/` denotes 472 | home folder.[Difficulty: **Beginner**] 473 | 474 | **solution:** 475 | ```{r,echo=FALSE,eval=FALSE} 476 | write.table(cpgi,file="~/my.cpgi.file.txt") 477 | 478 | ``` 479 | 480 | 481 | 8. Same as above but this time make sure to use the `quote=FALSE`,`sep="\t"` and `row.names=FALSE` arguments. Save the file to "my.cpgi.file2.txt" and compare it with "my.cpgi.file.txt". [Difficulty: **Beginner**] 482 | 483 | **solution:** 484 | ```{r,echo=FALSE,eval=FALSE} 485 | write.table(cpgi,file="~/my.cpgi.file2.txt",quote=FALSE,sep="\t",row.names=FALSE) 486 | 487 | ``` 488 | 489 | 490 | 9. Write out the first 10 rows of the `cpgi` data frame. 491 | **HINT:** Use subsetting for data frames we learned before. [Difficulty: **Beginner**] 492 | 493 | **solution:** 494 | ```{r,echo=FALSE,eval=FALSE} 495 | write.table(cpgi[1:10,],file="~/my.cpgi.fileNrow10.txt",quote=FALSE,sep="\t") 496 | 497 | ``` 498 | 499 | 500 | 501 | 10. Write the first 3 columns of the `cpgi` data frame. [Difficulty: **Beginner**] 502 | 503 | **solution:** 504 | ```{r,echo=FALSE,eval=FALSE} 505 | dfnew=cpgi[,1:3] 506 | write.table(dfnew,file="~/my.cpgi.fileCol3.txt",quote=FALSE,sep="\t") 507 | ``` 508 | 509 | 11. Write CpG islands only on chr1. **HINT:** Use subsetting with `[]`, feed a logical vector using `==` operator.[Difficulty: **Beginner/Intermediate**] 510 | 511 | **solution:** 512 | ```{r,echo=FALSE,eval=FALSE} 513 | write.table(cpgi[cpgi$chrom == "chr1",],file="~/my.cpgi.fileChr1.txt", 514 | quote=FALSE,sep="\t",row.names=FALSE) 515 | head(cpgi[cpgi$chrom == "chr1",]) 516 | 517 | 518 | ``` 519 | 520 | 521 | 12. Read two other data sets "rn4.refseq.bed" and "rn4.refseq2name.txt" with `header=FALSE`, and assign them to df1 and df2 respectively. 522 | They are again included in the compGenomRData package, and you 523 | can use the `system.file()` function to get the file paths. [Difficulty: **Beginner**] 524 | 525 | **solution:** 526 | ```{r,echo=FALSE,eval=FALSE} 527 | df1=read.table("../data/rn4.refseq.bed",sep="\t",header=FALSE) 528 | df2=read.table("../data/rn4.refseq2name.txt",sep="\t",header=FALSE) 529 | ``` 530 | 531 | 532 | 13. Use `head()` to see what is inside the data frames above. [Difficulty: **Beginner**] 533 | 534 | **solution:** 535 | ```{r,echo=FALSE,eval=FALSE} 536 | head(df1) 537 | head(df2) 538 | ``` 539 | 540 | 14. Merge data sets using `merge()` and assign the results to a variable named 'new.df', and use `head()` to see the results. [Difficulty: **Intermediate**] 541 | 542 | **solution:** 543 | ```{r,echo=FALSE,eval=FALSE} 544 | new.df=merge(df1,df2,by.x="V4",by.y="V1") 545 | head(new.df) 546 | 547 | ``` 548 | 549 | 550 | 551 | ### Plotting in R 552 | 553 | 554 | Please run the following code snippet for the rest of the exercises. 555 | ```{r plotExSeed} 556 | set.seed(1001) 557 | x1=1:100+rnorm(100,mean=0,sd=15) 558 | y1=1:100 559 | ``` 560 | 561 | 1. Make a scatter plot using the `x1` and `y1` vectors generated above. [Difficulty: **Beginner**] 562 | 563 | **solution:** 564 | ```{r,echo=FALSE,eval=FALSE} 565 | plot(x1,y1) 566 | 567 | ``` 568 | 569 | 570 | 2. Use the `main` argument to give a title to `plot()` as in `plot(x,y,main="title")`. [Difficulty: **Beginner**] 571 | 572 | **solution:** 573 | ```{r,echo=FALSE,eval=FALSE} 574 | plot(x1,y1,main="scatter plot") 575 | 576 | ``` 577 | 578 | 579 | 3. Use the `xlab` argument to set a label for the x-axis. Use `ylab` argument to set a label for the y-axis. [Difficulty: **Beginner**] 580 | 581 | **solution:** 582 | ```{r,echo=FALSE,eval=FALSE} 583 | plot(x1,y1,main="scatter plot",xlab="x label") 584 | 585 | ``` 586 | 587 | 4. Once you have the plot, run the following expression in R console. `mtext(side=3,text="hi there")` does. **HINT:** `mtext` stands for margin text. [Difficulty: **Beginner**] 588 | 589 | **solution:** 590 | ```{r,echo=FALSE,eval=FALSE} 591 | plot(x1,y1,main="scatter plot",xlab="x label",ylab="y label") 592 | mtext(side=3,text="hi there") 593 | 594 | ``` 595 | 596 | 5. See what `mtext(side=2,text="hi there")` does. Check your plot after execution. [Difficulty: **Beginner**] 597 | 598 | **solution:** 599 | ```{r,echo=FALSE,eval=FALSE} 600 | mtext(side=2,text="hi there") 601 | mtext(side=4,text="hi there") 602 | 603 | ``` 604 | 605 | 6. Use *mtext()* and *paste()* to put a margin text on the plot. You can use `paste()` as 'text' argument in `mtext()`. **HINT:** `mtext(side=3,text=paste(...))`. See how `paste()` is used for below. [Difficulty: **Beginner/Intermediate**] 606 | 607 | ```{r pasteExample} 608 | paste("Text","here") 609 | myText=paste("Text","here") 610 | myText 611 | ``` 612 | 613 | **solution:** 614 | ```{r,echo=FALSE,eval=FALSE} 615 | mtext(side=3,text=paste("here","here")) 616 | 617 | 618 | ``` 619 | 620 | 621 | 7. `cor()` calculates the correlation between two vectors. 622 | Pearson correlation is a measure of the linear correlation (dependence) 623 | between two variables X and Y. Try using the `cor()` function on the `x1` and `y1` variables. [Difficulty: **Intermediate**] 624 | 625 | **solution:** 626 | ```{r,echo=FALSE,eval=FALSE} 627 | corxy=cor(x1,y1) # calculates pearson correlation 628 | 629 | ``` 630 | 631 | 8. Try to use `mtext()`,`cor()` and `paste()` to display the correlation coefficient on your scatter plot. [Difficulty: **Intermediate**] 632 | 633 | **solution:** 634 | ```{r,echo=FALSE,eval=FALSE} 635 | plot(x1,y1,main="scatter") 636 | corxy=cor(x1,y1) 637 | #mtext(side=3,text=paste("Pearson Corr.",corxy)) 638 | mtext(side=3,text=paste("Pearson Corr.",round(corxy,3) ) ) 639 | 640 | plot(x1,y1) 641 | mtext(side=3,text=paste("Pearson Corr.",round( cor(x1,y1) ,3) ) ) 642 | ``` 643 | 644 | 9. Change the colors of your plot using the `col` argument. 645 | Ex: `plot(x,y,col="red")`. [Difficulty: **Beginner**] 646 | 647 | **solution:** 648 | ```{r,echo=FALSE,eval=FALSE} 649 | plot(x1,y1,col="red") 650 | 651 | ``` 652 | 653 | 10. Use `pch=19` as an argument in your `plot()` command. [Difficulty: **Beginner**] 654 | 655 | **solution:** 656 | ```{r,echo=FALSE,eval=FALSE} 657 | plot(x1,y1,col="red",pch=19) 658 | 659 | ``` 660 | 661 | 662 | 11. Use `pch=18` as an argument to your `plot()` command. [Difficulty: **Beginner**] 663 | 664 | **solution:** 665 | ```{r,echo=FALSE,eval=FALSE} 666 | plot(x1,y1,col="red",pch=18) 667 | ?points 668 | ``` 669 | 670 | 12. Make a histogram of `x1` with the `hist()` function. A histogram is a graphical representation of the data distribution. [Difficulty: **Beginner**] 671 | 672 | **solution:** 673 | ```{r,echo=FALSE,eval=FALSE} 674 | hist(x1) 675 | 676 | ``` 677 | 678 | 679 | 13. You can change colors with 'col', add labels with 'xlab', 'ylab', and add a 'title' with 'main' arguments. Try all these in a histogram. 680 | [Difficulty: **Beginner**] 681 | 682 | **solution:** 683 | ```{r,echo=FALSE,eval=FALSE} 684 | #coming soon 685 | 686 | ``` 687 | 688 | 14. Make a boxplot of y1 with `boxplot()`.[Difficulty: **Beginner**] 689 | 690 | **solution:** 691 | ```{r,echo=FALSE,eval=FALSE} 692 | boxplot(y1,main="title") 693 | 694 | ``` 695 | 696 | 15. Make boxplots of `x1` and `y1` vectors in the same plot.[Difficulty: **Beginner**] 697 | 698 | **solution:** 699 | ```{r,echo=FALSE,eval=FALSE} 700 | boxplot(x1,y1,ylab="values",main="title") 701 | 702 | ``` 703 | 704 | 16. In boxplot, use the `horizontal = TRUE` argument. [Difficulty: **Beginner**] 705 | 706 | **solution:** 707 | ```{r,echo=FALSE,eval=FALSE} 708 | boxplot(x1,y1,ylab="values",main="title",horizontal=TRUE) 709 | 710 | ``` 711 | 712 | 17. Make multiple plots with `par(mfrow=c(2,1))` 713 | - run `par(mfrow=c(2,1))` 714 | - make a boxplot 715 | - make a histogram 716 | [Difficulty: **Beginner/Intermediate**] 717 | 718 | **solution:** 719 | ```{r,echo=FALSE,eval=FALSE} 720 | par( mfrow=c(1,2) ) 721 | hist(x1) 722 | boxplot(y1) 723 | ``` 724 | 725 | 18. Do the same as above but this time with `par(mfrow=c(1,2))`. [Difficulty: **Beginner/Intermediate**] 726 | 727 | **solution:** 728 | ```{r,echo=FALSE,eval=FALSE} 729 | par(mfrow=c(2,2)) 730 | hist(x1) 731 | boxplot(y1) 732 | ``` 733 | 734 | 19. Save your plot using the "Export" button in Rstudio. [Difficulty: **Beginner**] 735 | 736 | **solution:** 737 | find and press Export button 738 | 739 | 740 | 20. You can make a scatter plot showing the density 741 | of points rather than points themselves. If you use points it looks like this: 742 | 743 | ```{r colorScatterEx,out.width='50%'} 744 | 745 | x2=1:1000+rnorm(1000,mean=0,sd=200) 746 | y2=1:1000 747 | plot(x2,y2,pch=19,col="blue") 748 | ``` 749 | 750 | If you use the `smoothScatter()` function, you get the densities. 751 | ```{r smoothScatterEx,out.width='50%'} 752 | smoothScatter(x2,y2, 753 | colramp=colorRampPalette(c("white","blue", 754 | "green","yellow","red"))) 755 | ``` 756 | 757 | Now, plot with the `colramp=heat.colors` argument and then use a custom color scale using the following argument. 758 | ``` 759 | colramp = colorRampPalette(c("white","blue", "green","yellow","red"))) 760 | ``` 761 | [Difficulty: **Beginner/Intermediate**] 762 | 763 | **solution:** 764 | ```{r,echo=FALSE,eval=FALSE} 765 | smoothScatter(x2,y2,colramp = heat.colors ) 766 | smoothScatter(x2,y2, 767 | colramp = colorRampPalette(c("white","blue", "green","yellow","red"))) 768 | 769 | ``` 770 | 771 | ### Functions and control structures (for, if/else, etc.) 772 | Read CpG island data as shown below for the rest of the exercises. 773 | 774 | ```{r CpGexReadchp2,eval=TRUE} 775 | cpgtFilePath=system.file("extdata", 776 | "CpGi.table.hg18.txt", 777 | package="compGenomRData") 778 | cpgi=read.table(cpgtFilePath,header=TRUE,sep="\t") 779 | head(cpgi) 780 | ``` 781 | 782 | 1. Check values in the perGc column using a histogram. 783 | The 'perGc' column in the data stands for GC percent => percentage of C+G nucleotides. [Difficulty: **Beginner**] 784 | 785 | **solution:** 786 | ```{r,echo=FALSE,eval=FALSE} 787 | hist(cpgi$perGc) # most values are between 60 and 70 788 | 789 | ``` 790 | 791 | 2. Make a boxplot for the 'perGc' column. [Difficulty: **Beginner**] 792 | 793 | **solution:** 794 | ```{r,echo=FALSE,eval=FALSE} 795 | boxplot(cpgi$perGc) 796 | 797 | ``` 798 | 799 | 800 | 3. Use if/else structure to decide if the given GC percent is high, low or medium. 801 | If it is low, high, or medium: low < 60, high>75, medium is between 60 and 75; 802 | use greater or less than operators, `<` or ` >`. Fill in the values in the code below, where it is written 'YOU_FILL_IN'. [Difficulty: **Intermediate**] 803 | ```{r functionEvExchp2,echo=TRUE,eval=FALSE} 804 | 805 | GCper=65 806 | 807 | # check if GC value is lower than 60, 808 | # assign "low" to result 809 | if('YOU_FILL_IN'){ 810 | result="low" 811 | cat("low") 812 | } 813 | else if('YOU_FILL_IN'){ # check if GC value is higher than 75, 814 | #assign "high" to result 815 | result="high" 816 | cat("high") 817 | }else{ # if those two conditions fail then it must be "medium" 818 | result="medium" 819 | } 820 | 821 | result 822 | 823 | ``` 824 | 825 | **solution:** 826 | ```{r,echo=FALSE,eval=FALSE} 827 | GCper=65 828 | #result="low"# set initial value 829 | 830 | if(GCper < 60){ # check if GC value is lower than 60, assign "low" to result 831 | result="low" 832 | cat("low") 833 | } 834 | else if(GCper > 75){ # check if GC value is higher than 75, assign "high" to result 835 | result="high" 836 | cat("high") 837 | }else{ # if those two conditions fail then it must be "medium" 838 | result="medium" 839 | } 840 | 841 | result 842 | 843 | 844 | ``` 845 | 846 | 4. Write a function that takes a value of GC percent and decides 847 | if it is low, high, or medium: low < 60, high>75, medium is between 60 and 75. 848 | Fill in the values in the code below, where it is written 'YOU_FILL_IN'. [Difficulty: **Intermediate/Advanced**] 849 | 850 | 851 | ``` 852 | GCclass<-function(my.gc){ 853 | 854 | YOU_FILL_IN 855 | 856 | return(result) 857 | } 858 | GCclass(10) # should return "low" 859 | GCclass(90) # should return "high" 860 | GCclass(65) # should return "medium" 861 | ``` 862 | 863 | **solution:** 864 | ```{r,echo=FALSE,eval=FALSE} 865 | GCclass<-function(my.gc){ 866 | 867 | result="low"# set initial value 868 | 869 | if(my.gc < 60){ # check if GC value is lower than 60, assign "low" to result 870 | result="low" 871 | } 872 | else if(my.gc > 75){ # check if GC value is higher than 75, assign "high" to result 873 | result="high" 874 | }else{ # if those two conditions fail then it must be "medium" 875 | result="medium" 876 | } 877 | return(result) 878 | } 879 | GCclass(10) # should return "low" 880 | GCclass(90) # should return "high" 881 | GCclass(65) # should return "medium" 882 | ``` 883 | 884 | 885 | 5. Use a for loop to get GC percentage classes for `gcValues` below. Use the function 886 | you wrote above.[Difficulty: **Intermediate/Advanced**] 887 | 888 | ``` 889 | gcValues=c(10,50,70,65,90) 890 | for( i in YOU_FILL_IN){ 891 | YOU_FILL_IN 892 | } 893 | ``` 894 | 895 | **solution:** 896 | ```{r,echo=FALSE,eval=FALSE} 897 | gcValues=c(10,50,70,65,90) 898 | for( i in gcValues){ 899 | 900 | print(GCclass(i) ) 901 | } 902 | ``` 903 | 904 | 905 | 6. Use `lapply` to get GC percentage classes for `gcValues`. [Difficulty: **Intermediate/Advanced**] 906 | 907 | ```{r lapplyExExerciseChp2,eval=FALSE} 908 | vec=c(1,2,4,5) 909 | power2=function(x){ return(x^2) } 910 | lapply(vec,power2) 911 | ``` 912 | 913 | 914 | **solution:** 915 | ```{r,echo=FALSE,eval=FALSE} 916 | s=lapply(gcValues,GCclass) 917 | 918 | ``` 919 | 920 | 921 | 922 | 7. Use sapply to get values to get GC percentage classes for `gcValues`. [Difficulty: **Intermediate**] 923 | 924 | **solution:** 925 | ```{r,echo=FALSE,eval=FALSE} 926 | s=sapply(gcValues,GCclass) 927 | 928 | ``` 929 | 930 | 8. Is there a way to decide on the GC percentage class of a given vector of `GCpercentages` 931 | without using if/else structure and loops ? if so, how can you do it? 932 | **HINT:** Subsetting using < and > operators. 933 | [Difficulty: **Intermediate**] 934 | 935 | **solution:** 936 | ```{r,echo=FALSE,eval=FALSE} 937 | result=rep("low",length(gcValues) ) 938 | result[gcValues > 75]="high" 939 | result[gcValues < 75 & gcValues > 60 ] = "medium" 940 | ``` 941 | -------------------------------------------------------------------------------- /chp03.md: -------------------------------------------------------------------------------- 1 | ## Exercises and solutions for chapter 3 2 | 3 | ### How to summarize collection of data points: The idea behind statistical distributions 4 | 5 | 1. Calculate the means and variances 6 | of the rows of the following simulated data set, and plot the distributions 7 | of means and variances using `hist()` and `boxplot()` functions. [Difficulty: **Beginner/Intermediate**] 8 | ```{r getDataChp3Ex,eval=FALSE} 9 | set.seed(100) 10 | 11 | #sample data matrix from normal distribution 12 | gset=rnorm(600,mean=200,sd=70) 13 | data=matrix(gset,ncol=6) 14 | ``` 15 | 16 | **solution:** 17 | ```{r,echo=FALSE,eval=FALSE} 18 | require(matrixStats) 19 | means=rowMeans(data) 20 | vars=rowVars(data) 21 | hist(means) 22 | hist(vars) 23 | boxplot(means) 24 | boxplot(vars) 25 | ``` 26 | 27 | 28 | 2. Using the data generated above, calculate the standard deviation of the 29 | distribution of the means using the `sd()` function. Compare that to the expected 30 | standard error obtained from the central limit theorem keeping in mind the 31 | population parameters were $\sigma=70$ and $n=6$. How does the estimate from the random samples change if we simulate more data with 32 | `data=matrix(rnorm(6000,mean=200,sd=70),ncol=6)`? [Difficulty: **Beginner/Intermediate**] 33 | 34 | **solution:** 35 | ```{r,echo=FALSE,eval=FALSE} 36 | samples=sd(means) 37 | 38 | clt.se=70/sqrt(6) # estimate standard error from population parameters 39 | 40 | data1=matrix(rnorm(6000,mean=200,sd=70),ncol=6 41 | sd(rowMeans(data1)) 42 | 43 | data2=matrix(rnorm(6000,mean=200,sd=70),ncol=10) 44 | sd(rowMeans(data2)) 45 | ``` 46 | If we simulate more sets of points keeping the same sample size n=6 as indicated in the piece of code above the estimates won't be different. The standard error will shrink only if we sample more data points per set. 47 | 48 | 3. Simulate 30 random variables using the `rpois()` function. Do this 1000 times and calculate the mean of each sample. Plot the sampling distributions of the means 49 | using a histogram. Get the 2.5th and 97.5th percentiles of the 50 | distribution. [Difficulty: **Beginner/Intermediate**] 51 | 4. Use the `t.test()` function to calculate confidence intervals 52 | of the mean on the first random sample `pois1` simulated from the `rpois()` function below. [Difficulty: **Intermediate**] 53 | ```{r exRpoisChp3,eval=FALSE} 54 | #HINT 55 | set.seed(100) 56 | 57 | #sample 30 values from poisson dist with lamda paramater =30 58 | pois1=rpois(30,lambda=5) 59 | 60 | ``` 61 | 62 | 63 | **solution:** 64 | ```{r,echo=FALSE,eval=FALSE} 65 | require(mosaic) 66 | quantile((do(1000)*mean(rpois(30,lambda=5)))[,1],probs=c(0.025,0.975)) 67 | 68 | t.test(pois1) 69 | ``` 70 | 71 | 5. Use the bootstrap confidence interval for the mean on `pois1`. [Difficulty: **Intermediate/Advanced**] 72 | 73 | **solution:** 74 | ```{r,echo=FALSE,eval=FALSE} 75 | quantile((do(1000)*mean(sample(pois1,30,replace=T)))[,1],probs=c(0.025,0.975)) 76 | 77 | ``` 78 | 79 | 6. Compare the theoretical confidence interval of the mean from the `t.test` and the bootstrap confidence interval. Are they similar? [Difficulty: **Intermediate/Advanced**] 80 | 81 | **solution:** 82 | They are all similar but not identical 83 | 84 | 85 | 7. Try to re-create the following figure, which demonstrates the CLT concept.[Difficulty: **Advanced**] 86 | 87 | **solution:** 88 | ```{r,echo=FALSE,eval=FALSE} 89 | 90 | set.seed(101) 91 | require(mosaic) 92 | par(mfcol=c(4,3)) 93 | par(mar=c(5.1-2,4.1-1,4.1,2.1-2)) 94 | d=c(rnorm(1000,mean=10,sd=8)) 95 | hist(d,main="", 96 | col="black",border="white",breaks=20,xlab="",ylab="" 97 | ) 98 | abline(v=mean(d),col="red") 99 | mtext(expression(paste(mu,"=10")),cex=0.6) 100 | mtext("normal",cex=0.8,line=1) 101 | 102 | bimod10=rowMeans(do(1000)*rnorm(5,mean=10,sd=8)) 103 | bimod30=rowMeans(do(1000)*rnorm(15,mean=10,sd=8)) 104 | bimod100=rowMeans(do(1000)*rnorm(50,mean=10,sd=8)) 105 | hist(bimod10,xlim=c(0,20),main="",xlab="",ylab="",breaks=20,col="gray", 106 | border="gray") 107 | mtext("n=10",side=2,cex=0.8,line=2) 108 | hist(bimod30,xlim=c(0,20),main="",xlab="",ylab="",breaks=20,col="gray", 109 | border="gray") 110 | mtext("n=30",side=2,cex=0.8,line=2) 111 | hist(bimod100,xlim=c(0,20),main="",xlab="",ylab="",breaks=20,col="gray", 112 | border="gray") 113 | mtext("n=100",side=2,cex=0.8,line=2) 114 | 115 | d=rexp(1000) 116 | hist(d,main="", 117 | col="black",border="white",breaks=20,xlab="",ylab="" 118 | ) 119 | abline(v=mean(d),col="red") 120 | mtext(expression(paste(mu,"=1")),cex=0.6) 121 | mtext("exponential",cex=0.8,line=1) 122 | mtext("Distributions of different populations",line=2) 123 | 124 | exp10 =rowMeans(do(2000)*rexp(10)) 125 | exp30 =rowMeans(do(2000)*rexp(30)) 126 | exp100=rowMeans(do(2000)*rexp(100)) 127 | hist(exp10,xlim=c(0,2),main="",xlab="",ylab="",breaks=20,col="gray", 128 | border="gray") 129 | mtext("Sampling distribution of sample means",line=2) 130 | hist(exp30,xlim=c(0,2),main="",xlab="",ylab="",breaks=20,col="gray", 131 | border="gray") 132 | hist(exp100,xlim=c(0,2),main="",xlab="",ylab="",breaks=20,col="gray", 133 | border="gray") 134 | 135 | d=runif(1000) 136 | hist(d,main="", 137 | col="black",border="white",breaks=20,xlab="",ylab="" 138 | ) 139 | abline(v=mean(d),col="red") 140 | mtext(expression(paste(mu,"=0.5")),cex=0.6) 141 | 142 | mtext("uniform",cex=0.8,line=1) 143 | unif10 =rowMeans(do(1000)*runif(10)) 144 | unif30 =rowMeans(do(1000)*runif(30)) 145 | unif100=rowMeans(do(1000)*runif(100)) 146 | hist(unif10,xlim=c(0,1),main="",xlab="",ylab="",breaks=20,col="gray", 147 | border="gray") 148 | hist(unif30,xlim=c(0,1),main="",xlab="",ylab="",breaks=20,col="gray", 149 | border="gray") 150 | hist(unif100,xlim=c(0,1),main="",xlab="",ylab="",breaks=20,col="gray", 151 | border="gray") 152 | ``` 153 | 154 | ### How to test for differences in samples 155 | 1. Test the difference of means of the following simulated genes 156 | using the randomization, `t-test()`, and `wilcox.test()` functions. 157 | Plot the distributions using histograms and boxplots. [Difficulty: **Intermediate/Advanced**] 158 | ```{r exRnorm1chp3,eval=FALSE} 159 | set.seed(101) 160 | gene1=rnorm(30,mean=4,sd=3) 161 | gene2=rnorm(30,mean=3,sd=3) 162 | 163 | ``` 164 | 165 | **solution:** 166 | ```{r,echo=FALSE,eval=FALSE} 167 | 168 | t.test(gene1,gene2) 169 | wilcox.test(gene1,gene2) 170 | 171 | #rand test 172 | org.diff=mean(gene1)-mean(gene2) 173 | gene.df=data.frame(exp=c(gene1,gene2), 174 | group=c( rep("test",30),rep("control",30) ) ) 175 | 176 | 177 | exp.null <- do(1000) * diff(mosaic::mean(exp ~ shuffle(group), data=gene.df)) 178 | 179 | sum(exp.null>org.diff)/1000 180 | 181 | ``` 182 | 183 | 184 | 2. Test the difference of the means of the following simulated genes 185 | using the randomization, `t-test()` and `wilcox.test()` functions. 186 | Plot the distributions using histograms and boxplots. [Difficulty: **Intermediate/Advanced**] 187 | ```{r exRnorm2chp3,eval=FALSE} 188 | set.seed(100) 189 | gene1=rnorm(30,mean=4,sd=2) 190 | gene2=rnorm(30,mean=2,sd=2) 191 | 192 | ``` 193 | 194 | **solution:** 195 | ```{r,echo=FALSE,eval=FALSE} 196 | t.test(gene1,gene2) 197 | wilcox.test(gene1,gene2) 198 | #rand test 199 | org.diff=mean(gene1)-mean(gene2) 200 | gene.df=data.frame(exp=c(gene1,gene2), 201 | group=c( rep("test",30),rep("control",30) ) ) 202 | 203 | 204 | exp.null <- do(1000) * diff(mosaic::mean(exp ~ shuffle(group), data=gene.df)) 205 | 206 | sum(exp.null>org.diff)/1000 207 | ``` 208 | 209 | 3. We need an extra data set for this exercise. Read the gene expression data set as follows: 210 | `gexpFile=system.file("extdata","geneExpMat.rds",package="compGenomRData") data=readRDS(gexpFile)`. The data has 100 differentially expressed genes. The first 3 columns are the test samples, and the last 3 are the control samples. Do 211 | a t-test for each gene (each row is a gene), and record the p-values. 212 | Then, do a moderated t-test, as shown in section "Moderated t-tests" in this chapter, and record 213 | the p-values. Make a p-value histogram and compare two approaches in terms of the number of significant tests with the $0.05$ threshold. 214 | On the p-values use FDR (BH), Bonferroni and q-value adjustment methods. 215 | Calculate how many adjusted p-values are below 0.05 for each approach. 216 | [Difficulty: **Intermediate/Advanced**] 217 | 218 | **solution:** 219 | ```{r,echo=FALSE,eval=FALSE} 220 | gexpFile=system.file("extdata","geneExpMat.rds",package="compGenomRData") 221 | data=readRDS(gexpFile) 222 | 223 | group1=1:3 224 | group2=4:6 225 | n1=3 226 | n2=3 227 | dx=rowMeans(data[,group1])-rowMeans(data[,group2]) 228 | 229 | require(matrixStats) 230 | 231 | # get the esimate of pooled variance 232 | stderr = sqrt( (rowVars(data[,group1])*(n1-1) + 233 | rowVars(data[,group2])*(n2-1)) / (n1+n2-2) * ( 1/n1 + 1/n2 )) 234 | 235 | # do the shrinking towards median 236 | mod.stderr = (stderr + median(stderr)) / 2 # moderation in variation 237 | 238 | # esimate t statistic with moderated variance 239 | t.mod <- dx / mod.stderr 240 | 241 | # calculate P-value of rejecting null 242 | p.mod = 2*pt( -abs(t.mod), n1+n2-2 ) 243 | 244 | # esimate t statistic without moderated variance 245 | t = dx / stderr 246 | 247 | # calculate P-value of rejecting null 248 | p = 2*pt( -abs(t), n1+n2-2 ) 249 | 250 | par(mfrow=c(1,2)) 251 | hist(p,col="cornflowerblue",border="white",main="",xlab="P-values t-test") 252 | mtext(paste("signifcant tests:",sum(p<0.05)) ) 253 | hist(p.mod,col="cornflowerblue",border="white",main="", 254 | xlab="P-values mod. t-test") 255 | mtext(paste("signifcant tests:",sum(p.mod<0.05)) ) 256 | 257 | ``` 258 | 259 | ### Relationship between variables: Linear models and correlation 260 | 261 | Below we are going to simulate X and Y values that are needed for the 262 | rest of the exercise. 263 | ```{r exLM1chp3,eval=FALSE} 264 | # set random number seed, so that the random numbers from the text 265 | # is the same when you run the code. 266 | set.seed(32) 267 | 268 | # get 50 X values between 1 and 100 269 | x = runif(50,1,100) 270 | 271 | # set b0,b1 and variance (sigma) 272 | b0 = 10 273 | b1 = 2 274 | sigma = 20 275 | # simulate error terms from normal distribution 276 | eps = rnorm(50,0,sigma) 277 | # get y values from the linear equation and addition of error terms 278 | y = b0 + b1*x+ eps 279 | ``` 280 | 281 | 282 | 1. Run the code then fit a line to predict Y based on X. [Difficulty:**Intermediate**] 283 | 284 | **solution:** 285 | ```{r,echo=FALSE,eval=FALSE} 286 | lm(y~x) 287 | 288 | ``` 289 | 290 | 2. Plot the scatter plot and the fitted line. [Difficulty:**Intermediate**] 291 | 292 | **solution:** 293 | ```{r,echo=FALSE,eval=FALSE} 294 | plot(x,y) 295 | abline(lm(y~x)) 296 | 297 | ``` 298 | 299 | 3. Calculate correlation and R^2. [Difficulty:**Intermediate**] 300 | 301 | **solution:** 302 | ```{r,echo=FALSE,eval=FALSE} 303 | cor(x,y) # pearson cor 304 | cor(x,y)^2 # R squared 305 | summary(lm(y~x)) 306 | 307 | ``` 308 | 309 | 4. Run the `summary()` function and 310 | try to extract P-values for the model from the object 311 | returned by `summary`. See `?summary.lm`. [Difficulty:**Intermediate/Advanced**] 312 | 313 | **solution:** 314 | ```{r,echo=FALSE,eval=FALSE} 315 | summary(lm(y~x)) 316 | 317 | ``` 318 | 319 | 5. Plot the residuals vs. the fitted values plot, by calling the `plot()` 320 | function with `which=1` as the second argument. First argument 321 | is the model returned by `lm()`. [Difficulty:**Advanced**] 322 | 323 | **solution:** 324 | ```{r,echo=FALSE,eval=FALSE} 325 | mod=lm(y~x) 326 | plot(mod,which=1) 327 | 328 | ``` 329 | 330 | 331 | 332 | 6. For the next exercises, read the data set histone modification data set. Use the following to get the path to the file: 333 | ``` 334 | hmodFile=system.file("extdata", 335 | "HistoneModeVSgeneExp.rds", 336 | package="compGenomRData") 337 | ``` 338 | There are 3 columns in the dataset. These are measured levels of H3K4me3, 339 | H3K27me3 and gene expression per gene. Once you read in the data, plot the scatter plot for H3K4me3 vs. expression. [Difficulty:**Beginner**] 340 | 341 | **solution:** 342 | ```{r,echo=FALSE,eval=FALSE} 343 | hdat=readRDS(hmodFile) 344 | plot(hdat$H3k4me3,hdat$measured_log2) 345 | 346 | ``` 347 | 348 | 349 | 7. Plot the scatter plot for H3K27me3 vs. expression. [Difficulty:**Beginner**] 350 | 351 | **solution:** 352 | ```{r,echo=FALSE,eval=FALSE} 353 | plot(hdat$H3k27me3,hdat$measured_log2) 354 | 355 | ``` 356 | 357 | 8. Fit the model for prediction of expression data using: 1) Only H3K4me3 as explanatory variable, 2) Only H3K27me3 as explanatory variable, and 3) Using both H3K4me3 and H3K27me3 as explanatory variables. Inspect the `summary()` function output in each case, which terms are significant. [Difficulty:**Beginner/Intermediate**] 358 | 359 | **solution:** 360 | ```{r,echo=FALSE,eval=FALSE} 361 | mod1=lm(measured_log2~H3k4me3,data=hdat) 362 | mod2=lm(measured_log2~H3k27me3,data=hdat) 363 | mod3=lm(measured_log2~H3k4me3+H3k27me3,data=hdat) 364 | 365 | summary(mod1) 366 | summary(mod2) 367 | summary(mod3) 368 | 369 | ``` 370 | all terms are significant. 371 | 372 | 373 | 10. Is using H3K4me3 and H3K27me3 better than the model with only H3K4me3? [Difficulty:**Intermediate**] 374 | 375 | **solution:** 376 | Using both histone marks are only slightly better based on R-squared values compared to just using H3K4me3. 377 | 378 | 11. Plot H3k4me3 vs. H3k27me3. Inspect the points that do not 379 | follow a linear trend. Are they clustered at certain segments 380 | of the plot? Bonus: Is there any biological or technical interpretation 381 | for those points? [Difficulty:**Intermediate/Advanced**] 382 | 383 | **solution:** 384 | ```{r,echo=FALSE,eval=FALSE} 385 | smoothScatter(hdat$H3k27me3,hdat$H3k4me3) 386 | cor(hdat$H3k27me3,hdat$H3k4me3) 387 | ``` 388 | There is a slight negative correlation between these two marks. 389 | -------------------------------------------------------------------------------- /chp04.md: -------------------------------------------------------------------------------- 1 | ## Exercises and solutions for Chapter 4 2 | 3 | For this set of exercises we will be using the expression data shown below: 4 | ```{r dataLoadClu,eval=FALSE} 5 | expFile=system.file("extdata", 6 | "leukemiaExpressionSubset.rds", 7 | package="compGenomRData") 8 | mat=readRDS(expFile) 9 | 10 | ``` 11 | 12 | ### Clustering 13 | 14 | 1. We want to observe the effect of data transformation in this exercise. Scale the expression matrix with the `scale()` function. In addition, try taking the logarithm of the data with the `log2()` function prior to scaling. Make box plots of the unscaled and scaled data sets using the `boxplot()` function. [Difficulty: **Beginner/Intermediate**] 15 | 16 | **solution:** 17 | ```{r,echo=FALSE,eval=FALSE} 18 | # transform data matrix 19 | scaled_mat <- scale(mat) # by default, center=TRUE, scale=TRUE 20 | logscaled_mat <- scale(log2(mat)) 21 | # make boxplots 22 | boxplot(mat) 23 | boxplot(scaled_mat) 24 | boxplot(logscaled_mat) 25 | ``` 26 | 27 | 2. For the same problem above using the unscaled data and different data transformation strategies, use the `ward.d` distance in hierarchical clustering and plot multiple heatmaps. You can try to use the `pheatmap` library or any other library that can plot a heatmap with a dendrogram. Which data-scaling strategy provides more homogeneous clusters with respect to disease types? [Difficulty: **Beginner/Intermediate**] 28 | 29 | **solution:** 30 | ```{r,echo=FALSE,eval=FALSE} 31 | library(pheatmap) # See https://towardsdatascience.com/pheatmap-draws-pretty-heatmaps-483dab9a3cc for a helpful tutorial 32 | 33 | # The leukemia type for each sample is indicated by the first 3 letters of each ID, which we'll store to use in pheatmap 34 | annotation_col <- data.frame(LeukemiaType =substr(colnames(mat),1,3)) # store first 3 letters from each ID in dataframe 35 | rownames(annotation_col)=colnames(mat) # add rownames to this dataframe to crossreference each ID to its type in next step 36 | 37 | # generate heatmap with unscaled data 38 | pheatmap(mat = mat, 39 | show_rownames = FALSE, show_colnames = FALSE, # names are too long to print 40 | annotation_col = annotation_col, # to show leukemia type for each sample 41 | clustering_method = "ward.D2", # distance metric is euclidean by default 42 | main = "Unscaled heatmap") # adds title 43 | # generate heatmap with scaled data 44 | pheatmap(mat = scaled_mat, show_rownames = FALSE, show_colnames = FALSE, annotation_col = annotation_col, clustering_method = "ward.D2", main = "Scaled heatmap") 45 | # generate heatmap with log-scaled data 46 | pheatmap(mat = logscaled_mat, show_rownames = FALSE, show_colnames = FALSE, annotation_col = annotation_col, clustering_method = "ward.D2", main = "Log-scaled heatmap") 47 | ``` 48 | 49 | *The clusters seem comparable with all three strategies: In all three heatmaps, one case doesn't cluster with its type, but otherwise the remaining cases do cluster as expected.* 50 | 51 | 52 | 3. For the transformed and untransformed data sets used in the exercise above, use the silhouette for deciding number of clusters using hierarchical clustering. [Difficulty: **Intermediate/Advanced**] 53 | 54 | **solution:** 55 | ```{r,echo=FALSE,eval=FALSE} 56 | #coming soon 57 | 58 | ``` 59 | 60 | 4. Now, use the Gap Statistic for deciding the number of clusters in hierarchical clustering. Is it the same number of clusters identified by two methods? Is it similar to the number of clusters obtained using the k-means algorithm in the chapter. [Difficulty: **Intermediate/Advanced**] 61 | 62 | **solution:** 63 | ```{r,echo=FALSE,eval=FALSE} 64 | #coming soon 65 | 66 | ``` 67 | 68 | ### Dimension reduction 69 | We will be using the leukemia expression data set again. You can use it as shown in the clustering exercises. 70 | 71 | 1. Do PCA on the expression matrix using the `princomp()` function and then use the `screeplot()` function to visualize the explained variation by eigenvectors. How many top components explain 95% of the variation? [Difficulty: **Beginner**] 72 | 73 | **solution:** 74 | ```{r,echo=FALSE,eval=FALSE} 75 | PCA_Leuk <- princomp(scaled_mat) 76 | summary(PCA_Leuk) 77 | screeplot(PCA_Leuk) 78 | ``` 79 | 80 | *While the first principal component explains nearly 50% of the variation, the top 25 principal components are required to explain 95% of the variation.* 81 | 82 | 2. Our next tasks are to remove eigenvectors and reconstruct the matrix using SVD, then calculate the reconstruction error as the difference between original and reconstructed matrix. Remove a few eigenvectors, reconstruct the matrix and calculate the reconstruction error. Reconstruction error can be euclidean distance between original and reconstructed matrices. HINT: You have to use the `svd()` function and equalize eigenvalue to $0$ for the component you want to remove. [Difficulty: **Intermediate/Advanced**] 83 | 84 | **solution:** 85 | ```{r,echo=FALSE,eval=FALSE} 86 | #coming soon 87 | 88 | ``` 89 | 90 | 3. Produce a 10-component ICA from the expression data set. Remove each component and measure the reconstruction error without that component. Rank the components by decreasing reconstruction-error. [Difficulty: **Advanced**] 91 | 92 | **solution:** 93 | ```{r,echo=FALSE,eval=FALSE} 94 | #coming soon 95 | 96 | ``` 97 | 98 | 4. In this exercise we use the `Rtsne()` function on the leukemia expression data set. Try to increase and decrease perplexity t-sne, and describe the observed changes in 2D plots. [Difficulty: **Beginner**] 99 | 100 | **solution:** 101 | ```{r,echo=FALSE,eval=FALSE} 102 | set.seed(42) # Set seed for reproducible results 103 | tsne_out <- Rtsne(mat) # Run t-sne using default perplexity = 30 104 | tsne_10 <- Rtsne(mat, perplexity = 10) 105 | tsne_100 <- Rtsne(mat, perplexity = 100) 106 | 107 | 108 | # Show the objects in the 2D tsne representation 109 | plot(tsne_out$Y,col=as.factor(annotation_col$LeukemiaType), 110 | pch=19) 111 | 112 | # create the legend for the Leukemia types 113 | legend("bottomleft", 114 | legend=unique(annotation_col$LeukemiaType), 115 | fill =palette("default"), 116 | border=NA,box.col=NA) 117 | 118 | # Show the objects in the 2D tsne representation 119 | plot(tsne_10$Y,col=as.factor(annotation_col$LeukemiaType), 120 | pch=19) 121 | # Show the objects in the 2D tsne representation 122 | plot(tsne_100$Y,col=as.factor(annotation_col$LeukemiaType), 123 | pch=19) 124 | ``` 125 | 126 | *As perplexity increases, the range of the x- and y-axes are reduced. 127 | -------------------------------------------------------------------------------- /chp05.md: -------------------------------------------------------------------------------- 1 | ## Exercises and solutions for Chapter 5 2 | 3 | ### Classification 4 | For this set of exercises we will be using the gene expression and patient annotation data from the glioblastoma patient. You can read the data as shown below: 5 | ```{r,readMLdataEx,eval=FALSE} 6 | library(compGenomRData) 7 | # get file paths 8 | fileLGGexp=system.file("extdata", 9 | "LGGrnaseq.rds", 10 | package="compGenomRData") 11 | fileLGGann=system.file("extdata", 12 | "patient2LGGsubtypes.rds", 13 | package="compGenomRData") 14 | # gene expression values 15 | gexp=readRDS(fileLGGexp) 16 | 17 | # patient annotation 18 | patient=readRDS(fileLGGann) 19 | 20 | ``` 21 | 22 | 1. Our first task is to not use any data transformation and do classification. Run the k-NN classifier on the data without any transformation or scaling. What is the effect on classification accuracy for k-NN predicting the CIMP and noCIMP status of the patient? [Difficulty: **Beginner**] 23 | 24 | 2. Bootstrap resampling can be used to measure the variability of the prediction error. Use bootstrap resampling with k-NN for the prediction accuracy. How different is it from cross-validation for different $k$s? [Difficulty: **Intermediate**] 25 | 26 | 3. There are a number of ways to get variable importance for a classification problem. Run random forests on the classification problem above. Compare the variable importance metrics from random forest and the one obtained from DALEX. How many variables are the same in the top 10? [Difficulty: **Advanced**] 27 | 28 | 4. Come up with a unified importance score by normalizing importance scores from random forests and DALEX, followed by taking the average of those scores. [Difficulty: **Advanced**] 29 | 30 | ### Regression 31 | For this set of problems we will use the regression data set where we tried to predict the age of the sample from the methylation values. The data can be loaded as shown below: 32 | ```{r, readMethAgeex,eval=FALSE} 33 | # file path for CpG methylation and age 34 | fileMethAge=system.file("extdata", 35 | "CpGmeth2Age.rds", 36 | package="compGenomRData") 37 | 38 | # read methylation-age table 39 | ameth=readRDS(fileMethAge) 40 | ``` 41 | 42 | 1. Run random forest regression and plot the importance metrics. [Difficulty: **Beginner**] 43 | 44 | 2. Split 20% of the methylation-age data as test data and run elastic net regression on the training portion to tune parameters and test it on the test portion. [Difficulty: **Intermediate**] 45 | 46 | 3. Run an ensemble model for regression using the **caretEnsemble** or **mlr** package and compare the results with the elastic net and random forest model. Did the test accuracy increase? 47 | **HINT:** You need to install these extra packages and learn how to use them in the context of ensemble models. [Difficulty: **Advanced**] -------------------------------------------------------------------------------- /chp06.md: -------------------------------------------------------------------------------- 1 | ## Exercises and solutions for Chapter 6 2 | 3 | The data for the exercises is within the `compGenomRData` package. 4 | 5 | Run the following to see the data files. 6 | ``` 7 | dir(system.file("extdata", 8 | package="compGenomRData")) 9 | ``` 10 | You will need some of those files to complete the exercises. 11 | 12 | ### Operations on genomic intervals with the `GenomicRanges` package 13 | 14 | 1. Create a `GRanges` object using the information in the table below:[Difficulty: **Beginner**] 15 | 16 | | chr | start | end |strand | score | 17 | | :--- |:------| :-----| :-----|:-----| 18 | | chr1 | 10000 | 10300 | + | 10 | 19 | | chr1 | 11100 | 11500 | - | 20 | 20 | | chr2 | 20000 | 20030 | + | 15 | 21 | 22 | 23 | **solution:** 24 | ```{r,echo=FALSE,eval=FALSE} 25 | library(GenomicRanges) 26 | 27 | gr <- GRanges(seqnames=c("chr1","chr1","chr2"), 28 | ranges=IRanges(start=c(10000,11100,20000), 29 | end=c(10300,11500,20030)), 30 | strand=c("+","-","+"), 31 | score=c(10,20,15) 32 | ) 33 | 34 | gr 35 | 36 | ``` 37 | 38 | 39 | 2. Use the `start()`, `end()`, `strand()`,`seqnames()` and `width()` functions on the `GRanges` 40 | object you created. Figure out what they are doing. Can you get a subset of the `GRanges` object for intervals that are only on the + strand? If you can do that, try getting intervals that are on chr1. *HINT:* `GRanges` objects can be subset using the `[ ]` operator, similar to data frames, but you may need 41 | to use `start()`, `end()` and `strand()`,`seqnames()` within the `[]`. [Difficulty: **Beginner/Intermediate**] 42 | 43 | **solution:** 44 | ```{r,echo=FALSE,eval=FALSE} 45 | # For each genomic interval, return... 46 | start(gr) # ...its start position 47 | end(gr) # ...its end position 48 | strand(gr) # ...its strand 49 | seqnames(gr) #...its sequence name 50 | width(gr) # its calculated interval from start to end (inclusive) 51 | 52 | gr[strand(gr) == "+"] # Retrieve intervals (rows) on + strand using logical operation 53 | gr[seqnames(gr) == "chr1"] # Retrieve intervals on chr1 using logical operation 54 | 55 | ``` 56 | 57 | 58 | 59 | 3. Import mouse (mm9 assembly) CpG islands and RefSeq transcripts for chr12 from the UCSC browser as `GRanges` objects using `rtracklayer` functions. HINT: Check chapter content and modify the code there as necessary. If that somehow does not work, go to the UCSC browser and download it as a BED file. The track name for Refseq genes is "RefSeq Genes" and the table name is "refGene". [Difficulty: **Beginner/Intermediate**] 60 | 61 | **solution:** 62 | ```{r,echo=FALSE,eval=FALSE} 63 | library(rtracklayer) # load necessary package 64 | # start a session at the nearest gateway 65 | session <- browserSession("UCSC",url = 'http://genome.ucsc.edu/cgi-bin/') 66 | genome(session) <- "mm9" # mouse genome assembly 67 | 68 | ## CpG islands on chr12 69 | CpG.query <- ucscTableQuery(session, 70 | track="CpG Islands", 71 | table="cpgIslandExt", 72 | range=GRangesForUCSCGenome("mm9", "chr12")) 73 | ## get the GRanges object 74 | CPG.gr <- track(CpG.query) # works OK 75 | CPG.gr # glimpse GRanges object 76 | 77 | ## RefSeq Genes on chr12 78 | RSG.query <- ucscTableQuery(session, 79 | # track="RefSeq Genes", # error for Unknown track: RefSeq Genes 80 | track="NCBI RefSeq", # apparently renamed 81 | table="refGene", 82 | range=GRangesForUCSCGenome("mm9", "chr12")) 83 | 84 | ## get the GRanges object 85 | # track(RSG.query) # Error in stop_if_wrong_length(what, ans_len) : 'ranges' must have the length of the object to construct (1598) or length 1 86 | # Due to error, instead retrieve table data frame 87 | RSG.df <- getTable(RSG.query) 88 | # And then convert data frame into GRanges object 89 | RSG.gr <- GRanges( 90 | seqnames = RSG.df$chrom, # alternative to [] to select columns from df 91 | ranges = IRanges(start=RSG.df$txStart, 92 | end=RSG.df$txEnd), 93 | strand = RSG.df$strand, 94 | name = RSG.df$name, 95 | name2 = RSG.df$name2, 96 | exonCount = RSG.df$exonCount) 97 | RSG.gr # glimpse GRanges object 98 | 99 | ``` 100 | 101 | 4. Following from the exercise above, get the promoters of Refseq transcripts (-1000bp and +1000 bp of the TSS) and calculate what percentage of them overlap with CpG islands. HINT: You have to get the promoter coordinates and use the `findOverlaps()` or `subsetByOverlaps()` from the `GenomicRanges` package. To get promoters, type `?promoters` on the R console and see how to use that function to get promoters or calculate their coordinates as shown in the chapter. [Difficulty: **Beginner/Intermediate**] 102 | 103 | 104 | **solution:** 105 | ```{r,echo=FALSE,eval=FALSE} 106 | prom.gr <- promoters(RSG.gr, upstream = 1000, downstream = 1000) 107 | prom.gr <- unique(prom.gr) # another way to remove duplicates 108 | n_prom <- length(prom.gr) # number of unique promoters 109 | n_prom 110 | 111 | prom_CpGi1 <- findOverlaps(prom.gr, CPG.gr, select = "first") # only select first overlap 112 | n_prom_CpGi1 <- length(na.omit(prom_CpGi1)) # remove NAs and count number of first overlaps with unique promoters 113 | n_prom_CpGi1 114 | 115 | # compute percentage of promoters that overlap a CpG island(s) 116 | perc_overlap <- n_prom_CpGi1/n_prom * 100 117 | perc_overlap 118 | 119 | ``` 120 | 121 | 5. Plot the distribution of CpG island lengths for CpG islands that overlap with the 122 | promoters. [Difficulty: **Beginner/Intermediate**] 123 | 124 | 125 | **solution:** 126 | ```{r,echo=FALSE,eval=FALSE} 127 | prom_CpGi <- findOverlaps(prom.gr, CPG.gr) # get ALL overlaps 128 | ol_CpGi_prom <- unique(prom_CpGi@to) # select 'to' vector with row numbers of CpG islands overlapping with promoters and then remove duplicate row numbers 129 | CpGi.lengths <- width(CPG.gr[ol_CpGi_prom]) # subset the CpGi GRanges with an overlap and get the width of each range 130 | hist(CpGi.lengths, xlab = "Lengths (bps)", breaks = 10) # make a histogram of lengths 131 | 132 | ``` 133 | 134 | 6. Get canonical peaks for SP1 (peaks that are in both replicates) on chr21. Peaks for each replicate are located in the `wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1.broadPeak.gz` and `wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep2.broadPeak.gz` files. **HINT**: You need to use `findOverlaps()` or `subsetByOverlaps()` to get the subset of peaks that occur in both replicates (canonical peaks). You can try to read "...broadPeak.gz" files using the `genomation::readBroadPeak()` function; broadPeak is just an extended BED format. In addition, you can try to use `the coverage()` and `slice()` functions to get more precise canonical peak locations. [Difficulty: **Intermediate/Advanced**] 135 | 136 | 137 | **solution:** 138 | ```{r,echo=FALSE,eval=FALSE} 139 | #coming soon 140 | 141 | ``` 142 | 143 | ### Dealing with mapped high-throughput sequencing reads 144 | 145 | 1. Count the reads overlapping with canonical SP1 peaks using the BAM file for one of the replicates. The following file in the `compGenomRData` package contains the alignments for SP1 ChIP-seq reads: `wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam`. **HINT**: Use functions from the `GenomicAlignments` package. [Difficulty: **Beginner/Intermediate**] 146 | 147 | 148 | **solution:** 149 | ```{r,echo=FALSE,eval=FALSE} 150 | #coming soon 151 | 152 | ``` 153 | 154 | ### Dealing with contiguous scores over the genome 155 | 156 | 1. Extract the `Views` object for the promoters on chr20 from the `H1.ESC.H3K4me1.chr20.bw` file available at `CompGenomRData` package. Plot the first "View" as a line plot. **HINT**: See the code in the relevant section in the chapter and adapt the code from there. [Difficulty: **Beginner/Intermediate**] 157 | 158 | **solution:** 159 | ```{r,echo=FALSE,eval=FALSE} 160 | # first get a file with the promoter locations on chr20 161 | transcriptFile <- system.file("extdata", 162 | "refseq.hg19.chr20.bed", 163 | package="compGenomRData") 164 | library(genomation) 165 | feat20.grl <- readTranscriptFeatures(transcriptFile, # see CompGenomR 6.1.2.1 166 | remove.unusual = TRUE) 167 | feat20.grl # glimpse GRangesList object with exons, introns, promoters, TSSes on chr20 168 | 169 | prom20.gr <- feat20.grl$promoters # get promoters from the features 170 | prom20.gr 171 | 172 | bwFile <- system.file("extdata", 173 | "H1.ESC.H3K4me3.chr20.bw", 174 | package="compGenomRData") 175 | # import the H3K4me3 data for the chr20 promoters as RleList object 176 | cov.bw <- import(bwFile, which=prom20.gr, as = "RleList") 177 | myViews <- Views(cov.bw, as(prom20.gr,"IRangesList")) # get subsets of coverage 178 | # there is a views object for each chromosome 179 | myViews 180 | 181 | # plot the coverage vector from the 1st view 182 | plot(myViews[[1]][[1]], type="l") 183 | 184 | ``` 185 | 186 | 2. Make a histogram of the maximum signal for the Views in the object you extracted above. You can use any of the view summary functions or use `lapply()` and write your own summary function. [Difficulty: **Beginner/Intermediate**] 187 | 188 | **solution:** 189 | ```{r,echo=FALSE,eval=FALSE} 190 | hist(viewMaxs(myViews[[1]]), xlab = "Maximum H3K4me3 signal") 191 | 192 | ``` 193 | 194 | 3. Get the genomic positions of maximum signal in each view and make a `GRanges` object. **HINT**: See the `?viewRangeMaxs` help page. Try to make a `GRanges` object out of the returned object. [Difficulty: **Intermediate**] 195 | 196 | **solution:** 197 | ```{r,echo=FALSE,eval=FALSE} 198 | #coming soon 199 | 200 | ``` 201 | 202 | ### Visualizing and summarizing genomic intervals 203 | 204 | 1. Extract -500,+500 bp regions around the TSSes on chr21; there are refseq files for the hg19 human genome assembly in the `compGenomRData` package. Use SP1 ChIP-seq data in the `compGenomRData` package, access the file path via the `system.file()` function, the file name is: 205 | `wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam`. Create an average profile of read coverage around the TSSes. Following that, visualize the read coverage with a heatmap. **HINT**: All of these are possible using the `genomation` package functions. Check `help(ScoreMatrix)` to see how you can use bam files. As an example here is how you can get the file path to refseq annotation on chr21. [Difficulty: **Intermediate/Advanced**] 206 | ```{r example,eval=FALSE} 207 | transcriptFilechr21=system.file("extdata", 208 | "refseq.hg19.chr21.bed", 209 | package="compGenomRData") 210 | ``` 211 | 212 | **solution:** 213 | ```{r,echo=FALSE,eval=FALSE} 214 | #coming soon 215 | 216 | ``` 217 | 218 | 2. Extract -500,+500 bp regions around the TSSes on chr20. Use H3K4me3 (`H1.ESC.H3K4me3.chr20.bw`) and H3K27ac (`H1.ESC.H3K27ac.chr20.bw`) ChIP-seq enrichment data in the `compGenomRData` package and create heatmaps and average signal profiles for regions around the TSSes.[Difficulty: **Intermediate/Advanced**] 219 | 220 | **solution:** 221 | ```{r,echo=FALSE,eval=FALSE} 222 | #coming soon 223 | 224 | ``` 225 | 226 | 3. Download P300 ChIP-seq peaks data from the UCSC browser. The peaks are locations where P300 binds. The P300 binding marks enhancer regions in the genome. (**HINT**: group: "regulation", track: "Txn Factor ChIP", table:"wgEncodeRegTfbsClusteredV3", you need to filter the rows for "EP300" name.) Check enrichment of H3K4me3, H3K27ac and DNase-seq (`H1.ESC.dnase.chr20.bw`) experiments on chr20 on and arounf the P300 binding-sites, use data from `compGenomRData` package. Make multi-heatmaps and metaplots. What is different from the TSS profiles? [Difficulty: **Advanced**] 227 | 228 | **solution:** 229 | ```{r,echo=FALSE,eval=FALSE} 230 | #coming soon 231 | 232 | ``` 233 | 234 | 4. Cluster the rows of multi-heatmaps for the task above. Are there obvious clusters? **HINT**: Check arguments of the `multiHeatMatrix()` function. [Difficulty: **Advanced**] 235 | 236 | **solution:** 237 | ```{r,echo=FALSE,eval=FALSE} 238 | #coming soon 239 | 240 | ``` 241 | 242 | 243 | 5. Visualize one of the -500,+500 bp regions around the TSS using `Gviz` functions. You should visualize both H3K4me3 and H3K27ac and the gene models. [Difficulty: **Advanced**] 244 | 245 | **solution:** 246 | ```{r,echo=FALSE,eval=FALSE} 247 | #coming soon 248 | 249 | ``` 250 | 251 | -------------------------------------------------------------------------------- /chp07.md: -------------------------------------------------------------------------------- 1 | ## Exercises and solutions for Chapter 7 2 | 3 | For this set of exercises, we will use the `chip_1_1.fq.bz2` and `chip_2_1.fq.bz2` files from the `QuasR` package. You can reach the folder that contains the files as follows: 4 | ```{r seqProcessEx,eval=FALSE} 5 | folder=(system.file(package="QuasR", "extdata")) 6 | dir(folder) # will show the contents of the folder 7 | ``` 8 | 1. Plot the base quality distributions of the ChIP-seq samples `Rqc` package. 9 | **HINT**: You need to provide a regular expression pattern for extracting the right files from the folder. `"^chip"` matches the files beginning with "chip". [Difficulty: **Beginner/Intermediate**] 10 | 11 | 12 | **solution:** 13 | ```{r,echo=FALSE,eval=FALSE} 14 | 15 | folder=(system.file(package="QuasR", "extdata")) 16 | library(Rqc) 17 | folder = system.file(package="ShortRead", "extdata/E-MTAB-1147") 18 | 19 | # feeds fastq.qz files in "folder" to quality check function 20 | qcRes=rqc(path = folder, pattern = "^chip", openBrowser=FALSE) 21 | rqcCycleQualityBoxPlot(qcRes) 22 | 23 | ``` 24 | 25 | 2. Now we will trim the reads based on the quality scores. Let's trim 2-4 bases on the 3' end depending on the quality scores. You can use the `QuasR::preprocessReads()` function for this purpose. [Difficulty: **Beginner/Intermediate**] 26 | 27 | **solution:** 28 | ```{r,echo=FALSE,eval=FALSE} 29 | #coming soon 30 | 31 | ``` 32 | 33 | 3. Align the trimmed and untrimmed reads using `QuasR` and plot alignment statistics, did the trimming improve alignments? [Difficulty: **Intermediate/Advanced**] 34 | 35 | **solution:** 36 | ```{r,echo=FALSE,eval=FALSE} 37 | #coming soon 38 | 39 | ``` 40 | 41 | -------------------------------------------------------------------------------- /chp08.md: -------------------------------------------------------------------------------- 1 | ## Exercises and solutions for Chapter 8 2 | 3 | ### Exploring the count tables 4 | 5 | Here, import an example count table and do some exploration of the expression data. 6 | 7 | ```{r exSetup1, eval=FALSE} 8 | counts_file <- system.file("extdata/rna-seq/SRP029880.raw_counts.tsv", 9 | package = "compGenomRData") 10 | coldata_file <- system.file("extdata/rna-seq/SRP029880.colData.tsv", 11 | package = "compGenomRData") 12 | ``` 13 | 14 | 1. Normalize the counts using the TPM approach. [Difficulty: **Beginner**] 15 | 16 | **solution:** 17 | ```{r,echo=FALSE,eval=FALSE} 18 | #coming soon 19 | 20 | ``` 21 | 22 | 2. Plot a heatmap of the top 500 most variable genes. Compare with the heatmap obtained using the 100 most variable genes. [Difficulty: **Beginner**] 23 | 24 | **solution:** 25 | ```{r,echo=FALSE,eval=FALSE} 26 | #coming soon 27 | 28 | ``` 29 | 30 | 3. Re-do the heatmaps setting the `scale` argument to `none`, and `column`. Compare the results with `scale = 'row'`. [Difficulty: **Beginner**] 31 | 32 | **solution:** 33 | ```{r,echo=FALSE,eval=FALSE} 34 | #coming soon 35 | 36 | ``` 37 | 38 | 4. Draw a correlation plot for the samples depicting the sample differences as 'ellipses', drawing only the upper end of the matrix, and order samples by hierarchical clustering results based on `average` linkage clustering method. [Difficulty: **Beginner**] 39 | 40 | **solution:** 41 | ```{r,echo=FALSE,eval=FALSE} 42 | #coming soon 43 | 44 | ``` 45 | 46 | 5. How else could the count matrix be subsetted to obtain quick and accurate clusters? Try selecting the top 100 genes that have the highest total expression in all samples and re-draw the cluster heatmaps and PCA plots. [Difficulty: **Intermediate**] 47 | 48 | **solution:** 49 | ```{r,echo=FALSE,eval=FALSE} 50 | #coming soon 51 | 52 | ``` 53 | 54 | 6. Add an additional column to the annotation data.frame object to annotate the samples and use the updated annotation data.frame to plot the heatmaps. (Hint: Assign different batch values to CASE and CTRL samples). Make a PCA plot and color samples by the added variable (e.g. batch). [Difficulty: Intermediate] 55 | 56 | **solution:** 57 | ```{r,echo=FALSE,eval=FALSE} 58 | #coming soon 59 | 60 | ``` 61 | 62 | 7. Try making the heatmaps using all the genes in the count table, rather than sub-selecting. [Difficulty: **Advanced**] 63 | 64 | **solution:** 65 | ```{r,echo=FALSE,eval=FALSE} 66 | #coming soon 67 | 68 | ``` 69 | 70 | 8. Use the [`Rtsne` package](https://cran.r-project.org/web/packages/Rtsne/Rtsne.pdf) to draw a t-SNE plot of the expression values. Color the points by sample group. Compare the results with the PCA plots. [Difficulty: **Advanced**] 71 | 72 | **solution:** 73 | ```{r,echo=FALSE,eval=FALSE} 74 | #coming soon 75 | 76 | ``` 77 | 78 | 79 | ### Differential expression analysis 80 | 81 | Firstly, carry out a differential expression analysis starting from raw counts. 82 | Use the following datasets: 83 | 84 | ``` 85 | counts_file <- system.file("extdata/rna-seq/SRP029880.raw_counts.tsv", 86 | package = "compGenomRData") 87 | coldata_file <- system.file("extdata/rna-seq/SRP029880.colData.tsv", 88 | package = "compGenomRData") 89 | ``` 90 | 91 | - Import the read counts and colData tables. 92 | - Set up a DESeqDataSet object. 93 | - Filter out genes with low counts. 94 | - Run DESeq2 contrasting the `CASE` sample with `CONTROL` samples. 95 | 96 | Now, you are ready to do the following exercises: 97 | 98 | 1. Make a volcano plot using the differential expression analysis results. (Hint: x-axis denotes the log2FoldChange and the y-axis represents the -log10(pvalue)). [Difficulty: **Beginner**] 99 | 100 | **solution:** 101 | ```{r,echo=FALSE,eval=FALSE} 102 | #coming soon 103 | 104 | ``` 105 | 106 | 2. Use DESeq2::plotDispEsts to make a dispersion plot and find out the meaning of this plot. (Hint: Type ?DESeq2::plotDispEsts) [Difficulty: **Beginner**] 107 | 108 | **solution:** 109 | ```{r,echo=FALSE,eval=FALSE} 110 | #coming soon 111 | 112 | ``` 113 | 114 | 3. Explore `lfcThreshold` argument of the `DESeq2::results` function. What is its default value? What does it mean to change the default value to, for instance, `1`? [Difficulty: **Intermediate**] 115 | 116 | **solution:** 117 | ```{r,echo=FALSE,eval=FALSE} 118 | #coming soon 119 | 120 | ``` 121 | 122 | 4. What is independent filtering? What happens if we don't use it? Google `independent filtering statquest` and watch the online video about independent filtering. [Difficulty: **Intermediate**] 123 | 124 | **solution:** 125 | ```{r,echo=FALSE,eval=FALSE} 126 | #coming soon 127 | 128 | ``` 129 | 130 | 5. Re-do the differential expression analysis using the `edgeR` package. Find out how much DESeq2 and edgeR agree on the list of differentially expressed genes. [Difficulty: **Advanced**] 131 | 132 | **solution:** 133 | ```{r,echo=FALSE,eval=FALSE} 134 | #coming soon 135 | 136 | ``` 137 | 138 | 6. Use the `compcodeR` package to run the differential expression analysis using at least three different tools and compare and contrast the results following the `compcodeR` vignette. [Difficulty: **Advanced**] 139 | 140 | **solution:** 141 | ```{r,echo=FALSE,eval=FALSE} 142 | #coming soon 143 | 144 | ``` 145 | 146 | 147 | ### Functional enrichment analysis 148 | 149 | 1. Re-run gProfileR, this time using pathway annotations such as KEGG, REACTOME, and protein complex databases such as CORUM, in addition to the GO terms. Sort the resulting tables by columns `precision` and/or `recall`. How do the top GO terms change when sorted for `precision`, `recall`, or `p.value`? [Difficulty: **Beginner**] 150 | 151 | **solution:** 152 | ```{r,echo=FALSE,eval=FALSE} 153 | #coming soon 154 | 155 | ``` 156 | 157 | 2. Repeat the gene set enrichment analysis by trying different options for the `compare` argument of the `GAGE:gage` 158 | function. How do the results differ? [Difficulty: **Beginner**] 159 | 160 | **solution:** 161 | ```{r,echo=FALSE,eval=FALSE} 162 | #coming soon 163 | 164 | ``` 165 | 166 | 3. Make a scatter plot of GO term sizes and obtained p-values by setting the `gProfiler::gprofiler` argument `significant = FALSE`. Is there a correlation of term sizes and p-values? (Hint: Take -log10 of p-values). If so, how can this bias be mitigated? [Difficulty: **Intermediate**] 167 | 168 | **solution:** 169 | ```{r,echo=FALSE,eval=FALSE} 170 | #coming soon 171 | 172 | ``` 173 | 174 | 4. Do a gene-set enrichment analysis using gene sets from top 10 GO terms. [Difficulty: **Intermediate**] 175 | 176 | **solution:** 177 | ```{r,echo=FALSE,eval=FALSE} 178 | #coming soon 179 | 180 | ``` 181 | 182 | 5. What are the other available R packages that can carry out gene set enrichment analysis for RNA-seq datasets? [Difficulty: **Intermediate**] 183 | 184 | **solution:** 185 | ```{r,echo=FALSE,eval=FALSE} 186 | #coming soon 187 | 188 | ``` 189 | 190 | 6. Use the topGO package (https://bioconductor.org/packages/release/bioc/html/topGO.html) to re-do the GO term analysis. Compare and contrast the results with what has been obtained using the `gProfileR` package. Which tool is faster, `gProfileR` or topGO? Why? [Difficulty: **Advanced**] 191 | 192 | **solution:** 193 | ```{r,echo=FALSE,eval=FALSE} 194 | #coming soon 195 | 196 | ``` 197 | 198 | 7. Given a gene set annotated for human, how can it be utilized to work on _C. elegans_ data? (Hint: See `biomaRt::getLDS`). [Difficulty: **Advanced**] 199 | 200 | **solution:** 201 | ```{r,echo=FALSE,eval=FALSE} 202 | #coming soon 203 | 204 | ``` 205 | 206 | 8. Import curated pathway gene sets with Entrez identifiers from the [MSIGDB database](http://software.broadinstitute.org/gsea/msigdb/collections.jsp) and re-do the GSEA for all curated gene sets. [Difficulty: **Advanced**] 207 | 208 | **solution:** 209 | ```{r,echo=FALSE,eval=FALSE} 210 | #coming soon 211 | 212 | ``` 213 | 214 | 215 | ### Removing unwanted variation from the expression data 216 | 217 | For the exercises below, use the datasets at: 218 | ``` 219 | counts_file <- system.file('extdata/rna-seq/SRP049988.raw_counts.tsv', 220 | package = 'compGenomRData') 221 | colData_file <- system.file('extdata/rna-seq/SRP049988.colData.tsv', 222 | package = 'compGenomRData') 223 | ``` 224 | 225 | 1. Run RUVSeq using multiple values of `k` from 1 to 10 and compare and contrast the PCA plots obtained from the normalized counts of each RUVSeq run. [Difficulty: **Beginner**] 226 | 227 | **solution:** 228 | ```{r,echo=FALSE,eval=FALSE} 229 | #coming soon 230 | 231 | ``` 232 | 233 | 2. Re-run RUVSeq using the `RUVr()` function. Compare PCA plots from `RUVs`, `RUVg` and `RUVr` using the same `k` values and find out which one performs the best. [Difficulty: **Intermediate**] 234 | 235 | **solution:** 236 | ```{r,echo=FALSE,eval=FALSE} 237 | #coming soon 238 | 239 | ``` 240 | 241 | 3. Do the necessary diagnostic plots using the differential expression results from the EHF count table. [Difficulty: **Intermediate**] 242 | 243 | **solution:** 244 | ```{r,echo=FALSE,eval=FALSE} 245 | #coming soon 246 | 247 | ``` 248 | 249 | 4. Use the `sva` package to discover sources of unwanted variation and re-do the differential expression analysis using variables from the output of `sva` and compare the results with `DESeq2` results using `RUVSeq` corrected normalization counts. [Difficulty: **Advanced**] 250 | 251 | **solution:** 252 | ```{r,echo=FALSE,eval=FALSE} 253 | #coming soon 254 | 255 | ``` 256 | 257 | 258 | -------------------------------------------------------------------------------- /chp09.md: -------------------------------------------------------------------------------- 1 | ## Exercises and solutions for Chapter 9 2 | 3 | ### Quality control 4 | 5 | 1. Apply the fragment size estimation procedure to all ChIP and Input available datasets. [Difficulty: **Beginner**] 6 | 7 | **solution:** 8 | ```{r,echo=FALSE,eval=FALSE} 9 | #coming soon 10 | 11 | ``` 12 | 13 | 2. Visualize the resulting distributions. [Difficulty: **Beginner**] 14 | 15 | **solution:** 16 | ```{r,echo=FALSE,eval=FALSE} 17 | #coming soon 18 | 19 | ``` 20 | 21 | 3. How does the Input sample distribution differ from the ChIP samples? [Difficulty: **Beginner**] 22 | 23 | **solution:** 24 | ```{r,echo=FALSE,eval=FALSE} 25 | #coming soon 26 | 27 | ``` 28 | 29 | 4. Write a function which converts the bam files into bigWig files. [Difficulty: **Beginner**] 30 | 31 | **solution:** 32 | ```{r,echo=FALSE,eval=FALSE} 33 | #coming soon 34 | 35 | ``` 36 | 37 | 5. Apply the function to all files, and visualize them in the genome browser. 38 | Observe the signal profiles. What can you notice, about the similarity of the samples? [Difficulty: **Beginner**] 39 | 40 | **solution:** 41 | ```{r,echo=FALSE,eval=FALSE} 42 | #coming soon 43 | 44 | ``` 45 | 46 | 6. Use `GViz` to visualize the profiles for CTCF, SMC3 and ZNF143. [Difficulty: **Beginner/Intermediate**] 47 | 48 | **solution:** 49 | ```{r,echo=FALSE,eval=FALSE} 50 | #coming soon 51 | 52 | ``` 53 | 54 | 7. Calculate the cross correlation for both CTCF replicates, and 55 | the input samples. How does the profile look for the control samples? [Difficulty: **Intermediate**] 56 | 57 | **solution:** 58 | ```{r,echo=FALSE,eval=FALSE} 59 | #coming soon 60 | 61 | ``` 62 | 63 | 8. Calculate the cross correlation coefficients for all samples and 64 | visualize them as a heatmap. [Difficulty: **Intermediate**] 65 | 66 | **solution:** 67 | ```{r,echo=FALSE,eval=FALSE} 68 | #coming soon 69 | 70 | ``` 71 | 72 | #### Peak calling 73 | 74 | 1. Use `normR` to call peaks for all SMC3, CTCF, and ZNF143 samples. [Difficulty: **Beginner**] 75 | 76 | **solution:** 77 | ```{r,echo=FALSE,eval=FALSE} 78 | #coming soon 79 | 80 | ``` 81 | 82 | 2. Calculate the percentage of reads in peaks for the CTCF experiment. [Difficulty: **Intermediate**] 83 | 84 | **solution:** 85 | ```{r,echo=FALSE,eval=FALSE} 86 | #coming soon 87 | 88 | ``` 89 | 90 | 3. Download the blacklisted regions corresponding to the hg38 human genome, and calculate 91 | the percentage of CTCF peaks falling in such regions. [Difficulty: **Advanced**] 92 | 93 | **solution:** 94 | ```{r,echo=FALSE,eval=FALSE} 95 | #coming soon 96 | 97 | ``` 98 | 99 | 4. Unify the biological replicates by taking an intersection of peaks. 100 | How many peaks are specific to each biological replicate, and how many peaks overlap. [Difficulty: **Intermediate**] 101 | 102 | **solution:** 103 | ```{r,echo=FALSE,eval=FALSE} 104 | #coming soon 105 | 106 | ``` 107 | 108 | 5. Plot a scatter plot of signal strengths for biological replicates. Do intersecting 109 | peaks have equal signal strength in both samples? [Difficulty: **Intermediate**] 110 | 111 | **solution:** 112 | ```{r,echo=FALSE,eval=FALSE} 113 | #coming soon 114 | 115 | ``` 116 | 117 | 6. Quantify the combinatorial binding of all three proteins. Find 118 | the number of places which are bound by all three proteins, by 119 | a combination of two proteins, and exclusively by one protein. 120 | Annotate the different regions based on their genomic location. [Difficulty: **Advanced**] 121 | 122 | **solution:** 123 | ```{r,echo=FALSE,eval=FALSE} 124 | #coming soon 125 | 126 | ``` 127 | 128 | 7. Correlate the normR enrichment score for CTCF with peak presence/absence 129 | (create boxplots of enrichment for peaks which contain and do not contain CTCF motifs). [Difficulty: **Advanced**] 130 | 131 | **solution:** 132 | ```{r,echo=FALSE,eval=FALSE} 133 | #coming soon 134 | 135 | ``` 136 | 137 | 8. Explore the co-localization of CTCF and ZNF143. Where are the co-bound 138 | regions located? Which sequence motifs do they contain? Download the ChIA-pet 139 | data for the GM12878 cell line, and look at the 3D interaction between different 140 | classes of binding sites. [Difficulty: **Advanced**] 141 | 142 | **solution:** 143 | ```{r,echo=FALSE,eval=FALSE} 144 | #coming soon 145 | 146 | ``` 147 | 148 | #### Motif discovery 149 | 150 | 1. Repeat the motif discovery analysis on peaks from the ZNF143 transcription factor. 151 | How many motifs do you observe? How do the motifs look (visualize the motif logs)? [Difficulty: **Intermediate**] 152 | 153 | **solution:** 154 | ```{r,echo=FALSE,eval=FALSE} 155 | #coming soon 156 | 157 | ``` 158 | 159 | 2. Scan the ZNF143 peaks with the top motifs found in the previous exercise. 160 | Where are the motifs located? [Difficulty: **Advanced**] 161 | 162 | **solution:** 163 | ```{r,echo=FALSE,eval=FALSE} 164 | #coming soon 165 | 166 | ``` 167 | 168 | 3. Scan the CTCF peaks with the top motifs identified in the **ZNF143** peaks. 169 | Where are the motifs located? What can you conclude from the previous exercises? 170 | [Difficulty: **Advanced**] 171 | 172 | **solution:** 173 | ```{r,echo=FALSE,eval=FALSE} 174 | #coming soon 175 | 176 | ``` 177 | 178 | 179 | 180 | 181 | 182 | -------------------------------------------------------------------------------- /chp10.md: -------------------------------------------------------------------------------- 1 | ## Exercises 2 | 3 | ### Differential methylation 4 | The main objective of this exercise is getting differential methylated cytosines between two groups of samples: IDH-mut (AML patients with IDH mutations) vs. NBM (normal bone marrow samples). 5 | 6 | 1. Download methylation call files from GEO. These files are readable by methlKit using default `methRead` arguments. [Difficulty: **Beginner**] 7 | 8 | samples Link 9 | ------- ------ 10 | IDH1_rep1 [link](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM919990&format=file&file=GSM919990%5FIDH%2Dmut%5F1%5FmyCpG%2Etxt%2Egz) 11 | IDH1_rep2 [link](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM919991&format=file&file=GSM919991%5FIDH%5Fmut%5F2%5FmyCpG%2Etxt%2Egz) 12 | NBM_rep1 [link](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM919982&format=file&file=GSM919982%5FNBM%5F1%5FmyCpG%2Etxt%2Egz) 13 | NBM_rep2 [link](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM919984&format=file&file=GSM919984%5FNBM%5F2%5FRep1%5FmyCpG%2Etxt%2Egz) 14 | 15 | Example code for reading a file: 16 | ```{r exaReadMethExercise, eval=FALSE} 17 | library(methylKit) 18 | m=methRead("~/Downloads/GSM919982_NBM_1_myCpG.txt.gz", 19 | sample.id = "idh",assembly="hg18") 20 | ``` 21 | 22 | 23 | **solution:** 24 | ```{r,echo=FALSE,eval=FALSE} 25 | #coming soon 26 | 27 | ``` 28 | 29 | 30 | 2. Find differentially methylated cytosines. Use chr1 and chr2 only if you need to save time. You can subset it after you download the files either in R or Unix. The files are for hg18 assembly of human genome. [Difficulty: **Beginner**] 31 | 32 | **solution:** 33 | ```{r,echo=FALSE,eval=FALSE} 34 | #coming soon 35 | 36 | ``` 37 | 38 | 3. Describe the general differential methylation trend, what is the main effect for most CpGs? [Difficulty: **Intermediate**] 39 | 40 | **solution:** 41 | ```{r,echo=FALSE,eval=FALSE} 42 | #coming soon 43 | 44 | ``` 45 | 46 | 4. Annotate differentially methylated cytosines (DMCs) as promoter/intron/exon? [Difficulty: **Beginner**] 47 | 48 | **solution:** 49 | ```{r,echo=FALSE,eval=FALSE} 50 | #coming soon 51 | 52 | ``` 53 | 54 | 5. Which genes are the nearest to DMCs? [Difficulty: **Intermediate**] 55 | 56 | **solution:** 57 | ```{r,echo=FALSE,eval=FALSE} 58 | #coming soon 59 | 60 | ``` 61 | 62 | 6. Can you do gene set analysis either in R or via web-based tools? [Difficulty: **Advanced**] 63 | 64 | **solution:** 65 | ```{r,echo=FALSE,eval=FALSE} 66 | #coming soon 67 | 68 | ``` 69 | 70 | 71 | 72 | 73 | ### Methylome segmentation 74 | The main objective of this exercise is to learn how to do methylome segmentation and the downstream analysis for annotation and data integration. 75 | 76 | 1. Download the human embryonic stem-cell (H1 Cell Line) methylation bigWig files from the [Roadmap Epigenomics website](http://egg2.wustl.edu/roadmap/web_portal/processed_data.html#MethylData). It may take a while to understand how the website is structured and which bigWig file to use. That is part of the exercise. The files you will download are for hg19 assembly unless stated otherwise. [Difficulty: **Beginner**] 77 | 78 | **solution:** 79 | ```{r,echo=FALSE,eval=FALSE} 80 | #coming soon 81 | 82 | ``` 83 | 84 | 2. Do segmentation on hESC methylome. You can only use chr1 if using the whole genome takes too much time. [Difficulty: **Intermediate**] 85 | 86 | **solution:** 87 | ```{r,echo=FALSE,eval=FALSE} 88 | #coming soon 89 | 90 | ``` 91 | 92 | 3. Annotate segments and the kinds of gene-based features each segment class overlaps with (promoter/exon/intron). [Difficulty: **Beginner**] 93 | 94 | **solution:** 95 | ```{r,echo=FALSE,eval=FALSE} 96 | #coming soon 97 | 98 | ``` 99 | 100 | 4. For each segment type, annotate the segments with chromHMM annotations from the Roadmap Epigenome database available [here](https://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#core_15state). The specific file you should use is [here](https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E003_15_coreMarks_mnemonics.bed.gz). This is a bed file with chromHMM annotations. chromHMM annotations are parts of the genome identified by a hidden-Markov-model-based machine learning algorithm. The segments correspond to active promoters, enhancers, active transcription, insulators, etc. The chromHMM model uses histone modification ChIP-seq and potentially other ChIP-seq data sets to annotate the genome.[Difficulty: **Advanced**] 101 | 102 | **solution:** 103 | ```{r,echo=FALSE,eval=FALSE} 104 | #coming soon 105 | 106 | ``` 107 | 108 | -------------------------------------------------------------------------------- /chp11.md: -------------------------------------------------------------------------------- 1 | ## Exercises and solutions for Chapter 11 2 | 3 | ### Matrix factorization methods 4 | 5 | 1. Find features associated with iCluster and MFA factors, and visualize the feature weights. [Difficulty: **Beginner**] 6 | 7 | **solution:** 8 | 9 | iClusterPlus stores feature LASSO weights in the `beta` element of the result. Recall we saved the results of running `iClusterPlus` in the `r.icluster` variable. The data type is a list, with one element per data type, i.e. `r.icluster$beta[[1]]` will house the coefficients for the first data type, etc. 10 | 11 | FactoMineR stores MFA coefficients in exactly the same way as NMF, in the `mfa.w <- r.mfa$quanti.var$coord` variable. Visualization can follow the same logic as the NMF example in the text. 12 | 13 | 2. Normalizing the data matrices by their $\lambda_1$'s as in MFA supposes we wish to assign each data type the same importance in the down-stream analysis. This leads to a natural generalization whereby the different data types may be differently weighted. Provide an implementation of weighed-MFA where the different data types may be assigned individual weights. [Difficulty: **Intermediate**] 14 | 15 | **solution:** 16 | 17 | This can be achieved by using R's `prcomp` and foregoing the `FactoMineR` package. E.g. 18 | 19 | ```{r} 20 | +big_x <- rbind(x1 / factor1, x2 / factor2, x3 / factor3) 21 | +result <- prcomp(big_x) 22 | ``` 23 | 24 | where `factor_x` can be an arbitrary scaling factor chosen for each input modality. 25 | 26 | 3. In order to use NMF algorithms on data which can be negative, we need to split each feature into two new features, one positive and one negative. Implement the following function, and see that the included test does not fail: [Difficulty: **Intermediate/Advanced**] 27 | 28 | ```{r,moNMFExerciseColumnSplitting,eval=FALSE, echo=TRUE} 29 | # Implement this function 30 | split_neg_columns <- function(x) { 31 | # your code here 32 | } 33 | 34 | # a test that shows the function above works 35 | test_split_neg_columns <- function() { 36 | input <- as.data.frame(cbind(c(1,2,1),c(0,1,-2))) 37 | output <- as.data.frame(cbind(c(1,2,1), c(0,0,0), c(0,1,0), c(0,0,2))) 38 | stopifnot(all(output == split_neg_columns(input))) 39 | } 40 | 41 | # run the test to verify your solution 42 | test_split_neg_columns() 43 | ``` 44 | 45 | 46 | **solution:** 47 | ```{r,echo=FALSE,eval=FALSE} 48 | # For NMF to be usable on data types which allow negativity, we'll use a trick, 49 | # where we split each column into two non-negative columns, one for the negative 50 | # numbers and one for the positive numbers. Here's a function that does that: 51 | split_neg_columns <- function(x) { 52 | new_cols <- list() 53 | for(i in seq_len(dim(x)[2])) { 54 | new_cols[[paste0(colnames(x)[i],'+')]] <- sapply(X = x[,i], 55 | function(x) max(0,x)) 56 | new_cols[[paste0(colnames(x)[i],'-')]] <- sapply(X = -x[,i], 57 | function(x) max(0,x)) 58 | } 59 | new_cols 60 | return(do.call(cbind, new_cols)) 61 | } 62 | 63 | # and here's a test that shows the function above works 64 | test_split_neg_columns <- function() { 65 | test.input <- as.data.frame(cbind(c(1,2,1),c(0,1,-2))) 66 | expected.output <- as.data.frame(cbind(c(1,2,1), 67 | c(0,0,0), 68 | c(0,1,0), 69 | c(0,0,2))) 70 | stopifnot(all(as.matrix(expected.output) == as.matrix(split_neg_columns( 71 | test.input)))) 72 | } 73 | test_split_neg_columns() 74 | 75 | # Here's a function that undoes the previous transformation, 76 | # so it takes each pair of columns and merges them to one, where the values 77 | # from the first column are taken as positives, and the values from the second 78 | # column as negatives. 79 | merge_neg_columns <- function(x) { 80 | new_cols <- list() 81 | for(i in seq_len(dim(x)[2]/2)) { 82 | pos_col <- x[,2*i-1] 83 | neg_col <- x[,2*i] 84 | merged_col <- pos_col 85 | merged_col[neg_col>0] <- -neg_col[neg_col>0] 86 | new_cols[[i]] <- merged_col 87 | } 88 | return(do.call(cbind, new_cols)) 89 | } 90 | 91 | # And here's a test for the merging function. 92 | test_merge_neg_columns <- function() { 93 | input1 <- as.data.frame(cbind(c(1,2,1),c(0,1,-2))) 94 | input2 <- split_neg_columns(input1) 95 | output <- merge_neg_columns(input2) 96 | stopifnot(all(output == input1)) 97 | } 98 | test_merge_neg_columns() 99 | 100 | ``` 101 | 102 | 103 | 4. The iCluster+ algorithm has some parameters which may be tuned for maximum performance. The `iClusterPlus` package has a method, `iClusterPlus::tune.iClusterPlus`, which does this automatically based on the Bayesian Information Criterion (BIC). Run this method on the data from the examples above and find the optimal lambda and alpha values. [Difficulty: **Beginner/Intermediate**] 104 | 105 | **solution:** 106 | 107 | 108 | 109 | ```{r,echo=FALSE,eval=FALSE} 110 | r.icluster.tuned <- iClusterPlus::tune.iClusterPlus( 111 | cpus=4, 112 | t(x1), # Providing each omics type 113 | t(x2), 114 | t(x3), 115 | type=c("gaussian", "binomial", "multinomial"), # Providing the distributions 116 | K=2, # provide the number of factors to learn 117 | alpha=c(1,1,1), # as well as other model parameters 118 | n.lambda=35) 119 | 120 | ``` 121 | 122 | ### Clustering using latent factors 123 | 124 | 1. Why is one-hot clustering more suitable for NMF than iCluster? [Difficulty: **Intermediate**] 125 | 126 | **solution:** 127 | 128 | This is best approached by looking at the NMF and iCluster latent factor values, in figures 11.9 and 11.12. There, it is apparent that NMF latent factors tend to be disentangled, i.e. only one of them tends to be active for each data point. This is exactly the motivation for one-hot clustering. Contrast this with the latent factor values for iCluster. 129 | 130 | 2. Which clustering algorithm produces better results when combined with NMF, K-means, or one-hot clustering? Why do you think that is? [Difficulty: **Intermediate/Advanced**] 131 | 132 | **solution:** 133 | 134 | For the examples in the text above, it is clear that one-hot clustering provides a more intuitive and easier to explain clustering result in combination with NMF, because this disentangled NMF attempts to assign one factor to each sample, which also simplifies explaining what drives each cluster (the corresponding NMF factor). However, in general, one might have specific downstream analysis tasks for which K-means or other clustering algorithms might be more appropriate. 135 | 136 | ### Biological interpretation of latent factors 137 | 138 | 1. Another covariate in the metadata of these tumors is their _CpG island methylator Phenotype_ (CIMP). This is a phenotype carried by a group of colorectal cancers that display hypermethylation of promoter CpG island sites, resulting in the inactivation of some tumor suppressors. This is also assayed using an external test. Do any of the multi-omics methods surveyed find a latent variable that is associated with the tumor's CIMP phenotype? [Difficulty: **Beginner/Intermediate**] 139 | 140 | 141 | **solution:** 142 | 143 | The solution is similar to what was done in the text for MSI status: 144 | 145 | ```{r,moNMFCIMP,echo=FALSE, eval=FALSE} 146 | a <- data.frame(age=covariates$age, gender=as.factor(covariates$gender), msi=covariates$msi, cimp=as.factor(covariates$cimp)) 147 | b <- nmf.h 148 | colnames(b) <- c('factor1', 'factor2') 149 | cov_factor <- cbind(a,b) 150 | ggplot2::ggplot(cov_factor, ggplot2::aes(x=cimp, y=factor1, group=cimp)) + ggplot2::geom_boxplot() + ggplot2::ggtitle("NMF factor 1 and CIMP status") 151 | ggplot2::ggplot(cov_factor, ggplot2::aes(x=cimp, y=factor2, group=cimp)) + ggplot2::geom_boxplot() + ggplot2::ggtitle("NMF factor 2 and CIMP status") 152 | ``` 153 | 154 | At this point, examine the two figures. Is it apparent which factor is associated with CIMP status? 155 | 156 | 2. Does MFA give a disentangled representation? Does `iCluster` give disentangled representations? Why do you think that is? [Difficulty: **Advanced**] 157 | 158 | **solution:** 159 | 160 | MFA and iCluster tend not to give disentangled representations. MFA is based on principal component analysis, which finds variance-minimizing representations, wherein most features get assigned some nonzero weight most of the time. iCluster, with its expectation maximization approach, also does the same. 161 | 162 | 3. Figures \@ref(fig:moNMFClinicalCovariates) and \@ref(fig:moNMFClinicalCovariates2) show that MSI/MSS tumors have different values for NMF factors 1 and 2. Which NMF factor is associated with microsatellite instability? [Difficulty: **Beginner**] 163 | 164 | **solution:** 165 | 166 | Figure \@ref(fig:moNMFClinicalCovariates) shows that NMF factor 1 is high for samples which are microsatellite instable, and figure \@ref(fig:moNMFClinicalCovariates2) shows that NMF factor 2 is high for samples which are microsatellite stable. Hence, NMF factor 1 may be said to be associated with microsatellite instability. 167 | 168 | 4. Microsatellite instability (MSI) is associated with hyper-mutated tumors. As seen in Figure \@ref(fig:momutationsHeatmap), one of the subtypes has tumors with significantly more mutations than the other. Which subtype is that? Which NMF factor is associated with that subtype? And which NMF factor is associated with MSI? [Difficulty: **Advanced**] 169 | 170 | **solution:** 171 | 172 | From figure \@ref(fig:momutationsHeatmap) one can see that CMS1 tumors harbor many more mutations than do CMS3 tumors. Figure 11.13 shows that CMS1 is predicted by NMF factor 1, the same factor associated with Microsatellite instability (see previous exercise). 173 | -------------------------------------------------------------------------------- /compgenomr_exercises.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: knitr 13 | LaTeX: pdfLaTeX 14 | --------------------------------------------------------------------------------