├── .gitignore ├── HW1 ├── Homework1.Rmd ├── Homework1.html └── README.md ├── HW2 ├── code │ └── Homework2.Rmd ├── data │ ├── part3_counts.rds │ ├── part5 │ │ ├── .Rhistory │ │ ├── A.geneBodyCoverage.r │ │ ├── B.geneBodyCoverage.r │ │ ├── C.geneBodyCoverage.r │ │ ├── D.geneBodyCoverage.r │ │ ├── E.geneBodyCoverage.r │ │ ├── F.geneBodyCoverage.r │ │ ├── G.geneBodyCoverage.r │ │ ├── H.geneBodyCoverage.r │ │ ├── J.geneBodyCoverage.r │ │ ├── K.geneBodyCoverage.r │ │ ├── L.geneBodyCoverage.r │ │ └── M.geneBodyCoverage.r │ └── part5_example.pdf └── papers │ ├── part3_4-manuscript.pdf │ └── part6 │ ├── 1_original.pdf │ ├── 2_Hartl_response.pdf │ └── 3_response_to_Hartl.pdf ├── HW3 ├── Homework3_release.Rmd └── q1_data │ ├── BRCA_phenotype.txt │ ├── BRCA_zscore_data.txt │ ├── diagnosis.txt │ └── unknown_samples.txt ├── HW4 ├── README.md └── Stat115_Homework4.Rmd ├── HW5 ├── code │ └── STAT115_HW5_2020.Rmd ├── data │ └── HW5_ESC.Dixon_2015.DI.chr21.txt └── papers │ ├── PMID23001124.pdf │ ├── Supplement_10.1038_nature11082.pdf │ └── nejmoa2002032.pdf ├── HW6 ├── README.md ├── Stat115_Homework6.Rmd └── data │ ├── GBM_clin.txt │ ├── GBM_expr.txt │ ├── GBM_meth.txt │ ├── TCGA-02-2483.maf.txt │ ├── TCGA-06-0124.maf.txt │ ├── TCGA-06-0128.maf.txt │ ├── TCGA-06-0129.maf.txt │ ├── TCGA-06-0210.maf.txt │ ├── TCGA-06-2570.maf.txt │ ├── TCGA-06-5410.maf.txt │ ├── TCGA-06-5412.maf.txt │ ├── TCGA-06-5417.maf.txt │ ├── TCGA-06-6389.maf.txt │ ├── TCGA-14-1456.maf.txt │ ├── TCGA-14-4157.maf.txt │ ├── TCGA-19-1790.maf.txt │ ├── TCGA-19-2629.maf.txt │ ├── TCGA-26-1442.maf.txt │ ├── TCGA-26-5133.maf.txt │ ├── TCGA-27-2521.maf.txt │ ├── TCGA-28-5209.maf.txt │ ├── TCGA-28-5218.maf.txt │ ├── TCGA-32-4208.maf.txt │ ├── TCGA-32-4209.maf.txt │ ├── TCGA-32-4213.maf.txt │ └── TCGA-41-3393.maf.txt └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.Rhistory 2 | -------------------------------------------------------------------------------- /HW1/Homework1.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "STAT115 Homework 1" 3 | author: "" 4 | date: "February 10, 2020" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE, eval = TRUE) 10 | ``` 11 | 12 | 13 | # Part 0: Odyssey Signup 14 | 15 | Please fill out the Odyssey survey on Canvas so we can create an account for you 16 | 17 | # Part I: Introduction to R 18 | 19 | ## Problem 1: Installation 20 | 21 | **Please install the following R/Bioconductor packages** 22 | 23 | ```{r install, eval = FALSE} 24 | if (!requireNamespace("BiocManager", quietly = TRUE)) 25 | install.packages("BiocManager") 26 | BiocManager::install() 27 | BiocManager::install("sva") 28 | 29 | install.packages(c("ggplot2", "dplyr", "tidyr", "HistData", "mvtnorm", 30 | "reticulate")) 31 | ``` 32 | 33 | 34 | Please run this command (use `eval=TRUE`) to see if Bioconductor can work fine. 35 | 36 | ```{r, eval=FALSE} 37 | BiocManager::valid() 38 | ``` 39 | 40 | 41 | ```{r libraries, message = FALSE} 42 | # these packages are needed for HW2 43 | # affy and affyPLM are needed to read the microarray data and run RMA 44 | library(sva) # for batch effect correction. Contains ComBat and sva. 45 | library(ggplot2) # for plotting 46 | library(dplyr) # for data manipulation 47 | library(reticulate) # needed to run python in Rstudio 48 | # these next two are not essential to this course 49 | library(mvtnorm) # need this to simulate data from multivariate normal 50 | library(HistData) # need this for data 51 | ``` 52 | 53 | 54 | ## Problem 2: Getting help 55 | 56 | You can use the `mean()` function to compute the mean of a vector like 57 | so: 58 | 59 | ```{r mean} 60 | x1 <- c(1:10, 50) 61 | mean(x1) 62 | ``` 63 | 64 | However, this does not work if the vector contains NAs: 65 | 66 | ```{r mean-na} 67 | x1_na <- c(1:10, 50, NA) 68 | mean(x1_na) 69 | ``` 70 | 71 | **Please use R documentation to find the mean after excluding NA's (hint: `?mean`)** 72 | 73 | ```{r problem2} 74 | # your code here 75 | ``` 76 | 77 | 78 | Grading: Grade on correctness. 79 | 80 | + 0.5pt 81 | 82 | # Part II: Data Manipulation 83 | 84 | ## Problem 3: Basic Selection 85 | 86 | In this question, we will practice data manipulation using a dataset 87 | collected by Francis Galton in 1886 on the heights of parents and their 88 | children. This is a very famous dataset, and Galton used it to come up 89 | with regression and correlation. 90 | 91 | The data is available as `GaltonFamilies` in the `HistData` package. 92 | Here, we load the data and show the first few rows. To find out more 93 | information about the dataset, use `?GaltonFamilies`. 94 | 95 | ```{r loadGalton} 96 | data(GaltonFamilies) 97 | head(GaltonFamilies) 98 | ``` 99 | 100 | a. **Please report the height of the 10th child in the dataset.** 101 | 102 | ```{r problem3a} 103 | # your code here 104 | ``` 105 | 106 | b. **What is the breakdown of male and female children in the dataset?** 107 | 108 | ```{r problem3b} 109 | # your code here 110 | ``` 111 | 112 | c. **How many observations (number of rows) are in Galton's dataset? Please answer this 113 | question without consulting the R help.** 114 | 115 | ```{r problem3c} 116 | # your code here 117 | ``` 118 | 119 | d. **What is the mean height for the 1st child in each family?** 120 | 121 | ```{r problem3d} 122 | # your code here 123 | ``` 124 | 125 | e. **Create a table showing the mean height for male and female children.** 126 | ```{r problem3e} 127 | # your code here 128 | ``` 129 | 130 | f. **What was the average number of children each family had?** 131 | 132 | ```{r problem3f} 133 | # your code here 134 | ``` 135 | 136 | g. **Convert the children's heights from inches to centimeters and store 137 | it in a column called `childHeight_cm` in the `GaltonFamilies` dataset. 138 | Show the first few rows of this dataset.** 139 | 140 | ```{r problem3g} 141 | # your code here 142 | ``` 143 | 144 | 145 | 146 | ## Problem 4: Spurious Correlation 147 | 148 | ```{r gen-data-spurious, cache = TRUE, eval=TRUE} 149 | # set seed for reproducibility 150 | set.seed(1234) 151 | N <- 25 152 | ngroups <- 100000 153 | sim_data <- data.frame(group = rep(1:ngroups, each = N), 154 | X = rnorm(N * ngroups), 155 | Y = rnorm(N * ngroups)) 156 | ``` 157 | 158 | In the code above, we generate `r ngroups` groups of `r N` observations 159 | each. In each group, we have X and Y, where X and Y are independent 160 | normally distributed data and have 0 correlation. 161 | 162 | a. **Find the correlation between X and Y for each group, and display 163 | the highest correlations.** 164 | 165 | Hint: since the data is quite large and your code might take a few 166 | moments to run, you can test your code on a subset of the data first 167 | (e.g. you can take the first 100 groups like so): 168 | 169 | ```{r subset} 170 | ``` 171 | 172 | In general, this is good practice whenever you have a large dataset: 173 | If you are writing new code and it takes a while to run on the whole 174 | dataset, get it to work on a subset first. By running on a subset, you 175 | can iterate faster. 176 | 177 | However, please do run your final code on the whole dataset. 178 | 179 | ```{r cor, cache = TRUE} 180 | # your code here 181 | ``` 182 | 183 | b. **The highest correlation is around 0.8. Can you explain why we see 184 | such a high correlation when X and Y are supposed to be independent and 185 | thus uncorrelated?** 186 | 187 | Because we cherrypicked the highest correlations among 100,000 188 | correlations, it is just by chance that we found a few with such 189 | a high correlation. We can see in the histogram below that most of 190 | the correlations are around the expected value of 0. 191 | 192 | ```{r cor-hist, eval=F} 193 | ``` 194 | 195 | 196 | # Part III: Plotting 197 | 198 | ## Problem 5 199 | 200 | **Show a plot of the data for the group that had the highest correlation 201 | you found in Problem 4.** 202 | 203 | ```{r problem5} 204 | # your code here 205 | ``` 206 | 207 | Grading: 1pt. 208 | 209 | ## Problem 6 210 | 211 | We generate some sample data below. The data is numeric, and has 3 212 | columns: X, Y, Z. 213 | 214 | ```{r gen-data-corr} 215 | N <- 100 216 | Sigma <- matrix(c(1, 0.75, 0.75, 1), nrow = 2, ncol = 2) * 1.5 217 | means <- list(c(11, 3), c(9, 5), c(7, 7), c(5, 9), c(3, 11)) 218 | dat <- lapply(means, function(mu) 219 | rmvnorm(N, mu, Sigma)) 220 | dat <- as.data.frame(Reduce(rbind, dat)) %>% 221 | mutate(Z = as.character(rep(seq_along(means), each = N))) 222 | names(dat) <- c("X", "Y", "Z") 223 | ``` 224 | 225 | a. **Compute the overall correlation between X and Y.** 226 | 227 | ```{r problem6a} 228 | # your code here 229 | ``` 230 | 231 | b. **Make a plot showing the relationship between X and Y. Comment on 232 | the correlation that you see.** 233 | 234 | ```{r problem6b} 235 | # your code here 236 | ``` 237 | 238 | Your text answer here. 239 | 240 | The correlation between X and Y is negative. 241 | 242 | c. **Compute the correlations between X and Y for each level of Z.** 243 | 244 | ```{r problem6c} 245 | # your code here 246 | ``` 247 | 248 | d. **Make a plot showing the relationship between X and Y, but this 249 | time, color the points using the value of Z. Comment on the result, 250 | especially any differences between this plot and the previous plot.** 251 | 252 | ```{r problem6d} 253 | # your code here 254 | ``` 255 | 256 | Your text answer here. 257 | 258 | 259 | # Part IV: Bash practices 260 | 261 | ## Problem 7: Bash practices on Odyessy 262 | 263 | Please answer the following question using bash commands and include those in 264 | your answer. Data are available at `/n/stat115/2020/HW1/public_MC3.maf` 265 | 266 | Mutation Annotation Format ([MAF](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/)) 267 | is a tab-delimited text file with aggregated mutation information. 268 | MC3.maf `/n/stat115/2020/HW1/public_MC3.maf` is a curated list of [somatic mutation](https://www.britannica.com/science/somatic-mutation) 269 | occured in many patients with different types of cancers from TCGA. 270 | 271 | Since a complete MAF file contains far more information than we need, 272 | in this problem we will focus on part of it. 273 | 274 | ``` 275 | Chromosome Start_Position Hugo_Symbol Variant_Classification 276 | 10 123810032 TACC2 Missense_Mutation 277 | 10 133967449 JAKMIP3 Silent 278 | 11 124489539 PANX3 Missense_Mutation 279 | 11 47380512 SPI1 Missense_Mutation 280 | 11 89868837 NAALAD2 Missense_Mutation 281 | 11 92570936 FAT3 Silent 282 | 12 107371855 MTERFD3 Missense_Mutation 283 | 12 108012011 BTBD11 Missense_Mutation 284 | 12 117768962 NOS1 5'Flank 285 | ``` 286 | 287 | In `/n/stats115/2020/HW1/MC3/public_MC3.maf`, `Chromosome` and `Start_Position` 288 | together specifies the genomics location where a location has happened. 289 | `Hogo_symbol` is the overlapping gene of that location, and 290 | `Variant_Classification` specifies how it influences downstream biological 291 | processes, e.g. transcription and translation. 292 | 293 | Please include your bash commands and the full output from bash console 294 | with text answer to the questions. 295 | 296 | 297 | a. How many lines are there in this file? How many times "KRAS" gene has emerged? 298 | 299 | ```{r q7a, engine="bash", eval = FALSE} 300 | # your bash code here 301 | ``` 302 | 303 | ``` 304 | your bash output here 305 | ``` 306 | 307 | b. How many unique `Variant_Classification` are there in the MAF? Please 308 | count occurence of each type and sort them. Which one is the most frequent? 309 | 310 | ```{r q7b, engine="bash", eval = FALSE} 311 | # your bash code here 312 | ``` 313 | 314 | ``` 315 | your bash output here 316 | ``` 317 | 318 | Your text answer: 319 | 320 | c. What are the top FIVE most frequent genes? Please provide 321 | the bash command and equivalent Python command. If you are a PI 322 | looking for a gene to investigate (you need to find a gene with potentially 323 | better biological significance), which gene out of the top 5 would you 324 | choose? Why? 325 | 326 | ```{r q7c, engine="bash", eval = FALSE} 327 | # your bash code here 328 | ``` 329 | 330 | ``` 331 | your bash output here 332 | ``` 333 | 334 | Equivalent python command: 335 | 336 | ```{r q7cpy, engine="python", eval=FALSE} 337 | # your python command here 338 | ``` 339 | 340 | ``` 341 | your python output here 342 | ``` 343 | 344 | Yor text answer: 345 | 346 | 347 | d. Write a bash program that determines whether a user-input year ([YYYY]) is 348 | a leap year or not (all years that are multiples of four. If the year is 349 | centennial and not divisible by 400, then it is not a leap year). 350 | The user input can be either positional or interactive. 351 | Please include the content of your shell script here and test on 352 | 1900/2000/2002, does your code run as expected? 353 | 354 | ```{r q7d, engine="bash", eval = FALSE} 355 | # your bash code here 356 | ``` 357 | 358 | 359 | 360 | # Part V. High throughput sequencing read mapping 361 | 362 | We will give you a simple example to test high throughput sequencing 363 | alignment for RNA-seq data. Normally for paired-end sequencing data, 364 | each sample will have two separate FASTQ files, with line-by-line 365 | correspondence to the two reads from the same fragment. Read mapping 366 | could take a long time, so we have created just two FASTQ files of one 367 | RNA-seq sample with only 3M fragments (2 * 3M reads) for you to run STAR 368 | instead of the full data. The files are located at 369 | `/n/stat115/2020/HW1`. The mapping will generate one single output 370 | file. Make sure to use the right parameters for single-end (SE) vs 371 | paired-end (PE) modes in BWA and STAR. 372 | 373 | Please include the commands that you used to run BWA and STAR in your 374 | answers. 375 | 376 | 377 | ## Problem 8: BWA 378 | 379 | 1. Use BWA (Li & Durbin, Bioinformatics 2009) to map the reads to the 380 | Hg38 version of the reference genome, available on Odyssey at 381 | `/n/stat115/HW2_2019/bwa_hg38_index/hg38.fasta`. In 382 | `/n/stat115/HW1_2020/BWA/loop`, you are provided with three `.fastq` 383 | files with following structure (`A_l` and `A_r` are paired sequencing reads 384 | from sample_A). Write a for loop in bash to align reads to the reference 385 | using BWA PE mode and geneterate output in SAM format. 386 | 387 | How many rows are in each output `.sam` files? Use SAMTools on the output 388 | to find out how many reads are mappable and uniquely mappable 389 | (please also calculate the ratio). Please include full samtools output 390 | and text answer. 391 | 392 | 393 | ```{r 8, engine="bash", eval = FALSE} 394 | # please provide the content of your sbatch script (including the header) 395 | ``` 396 | 397 | ``` 398 | samtools output 399 | ``` 400 | 401 | You text answer 402 | 403 | ## Problem 9: STAR alignment 404 | 405 | 1. Use STAR (Dobin et al, Bioinformatics 2012) to map the reads to the 406 | reference genome, available on Odyssey at 407 | `/n/stat115/HW1_2020/STARIndex`. Use the paired-end alignment mode and 408 | generate the output in SAM format. Please include full STAR report. 409 | How many reads are mappable and how many are uniquely mappable? 410 | 411 | ```{r 9, engine="bash", eval = FALSE} 412 | # please provide the content of your sbatch script (including the header) 413 | ``` 414 | 415 | ``` 416 | Log file from STAR 417 | ``` 418 | Yor text answer here. 419 | 420 | 421 | 2. If you are getting a different number of mappable fragments between 422 | BWA and STAR on the same data, why? 423 | 424 | Your text answer here. 425 | 426 | 427 | # Part VII: Dynamic programming with Python 428 | 429 | ## Problem 10 430 | 431 | Given a list of finite integer numbers, 432 | Write a python script to maximize the Z where Z is the sum of the 433 | numbers from location X to location Y on this list. Be aware, your 434 | algorithm should look at each number ONLY ONCE from left to right. 435 | Your script should return three values: the starting index location X, 436 | the ending index location Y, and Z, the sum of numbers between index 437 | X and Y (inclusive). 438 | 439 | For example, if A=[-2, 1, 7, -4, 5, 2, -3, -6, 4, 3, -8, -1, 6, -7, -9, -5], 440 | your program should return (start_index = 1, end_index = 5, sum = 11) 441 | corresponding to [1, 7, -4, 5, 2]. 442 | 443 | Please test your program with this example and see if you can get the 444 | correct numbers. 445 | 446 | Hint: Consider dynamic programming. 447 | 448 | ```{python dynamic-programming, eval=TRUE, echo = TRUE} 449 | 450 | ``` 451 | 452 | -------------------------------------------------------------------------------- /HW1/README.md: -------------------------------------------------------------------------------- 1 | # Homework-1 2 | 3 | - Due: 02/09/2020 11:59pm 4 | 5 | Please fill the Odyssey signup form by Wednesday 01/29 so that you can get the account in time. 6 | Please feel free to post on Canvas Discussion if you find anything unclear. Thanks! 7 | 8 | -------------------------------------------------------------------------------- /HW2/code/Homework2.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "STAT115 Homework 2" 3 | author: '' 4 | date: 'Due: Sunday, Feburary 23, 2020 at 11:59pm' 5 | output: 6 | html_document: default 7 | pdf_document: default 8 | --- 9 | 10 | ```{r setup, include=FALSE} 11 | knitr::opts_chunk$set(echo = TRUE, eval = TRUE)# please knit with `echo=TRUE, eval=TRUE` 12 | ``` 13 |

14 | 15 | # Part I: RNA-seq quality control 16 | For this question, we will examine a series of tools to perform essential quality-control analyses for high-throughput RNA sequencing data. 17 | 18 | ### Problem I.1 (3pts) 19 | **You are asked by a collaborator to analyze four RNA-seq libraries. She suspects that the libraries are generally of high-quality but is concerned that a sample may have been switched with her benchmates during processing. Execute FastQC, STAR, and RSeQC (tin.py) to determine whether any of the samples exhibit unusual quailty control metrics. Overall, identify the best and worst libraries. Your answer should provide evidence from all three tools. Include screen shots and tables as necessary as if you were delivering a report to the collaborator.** 20 | ``` 21 | Sequencing data: 22 | /n/stat115/2020/HW2/raw_data 23 | 24 | modules: fastqc/0.11.8-fasrc01, STAR/2.6.0c-fasrc01 25 | index: /n/stat115/2020/HW2/star_hg38_index 26 | bed: /n/stat115/2020/HW2/hg38_RefSeq.bed 27 | ``` 28 | 29 | ``` 30 | Hint: not required but helpful to run STAR with these parameters: 31 | --outSAMtype BAM SortedByCoordinate 32 | --readFilesCommand zcat 33 | ``` 34 | 35 | Student response 36 |
37 | 38 | ### Problem I.2 (0.5pt; graduate students only) 39 | 40 | **Your collaborator recalls that one of her samples was left on the bench for a couple of days before the full RNA-seq library was processed. Using the metrics from the question above, can you identify this sample? Provide your rationale.** 41 | 42 | Student response 43 | 44 | ----- 45 | 46 |

47 | 48 | # Part II: Pseudoalignment 49 | 50 | ### Problem II.1 (1pt) 51 | **Process the 4 sequencing libraries with Salmon introduced in the previous question. Identify the transcript and gene with the highest expression in each library from the Salmon output.** 52 | 53 | ``` 54 | module: salmon/0.12.0-fasrc01 55 | index: /n/stat115/2020/HW2/salmon_hg38_index 56 | ``` 57 | 58 | Student response 59 |
60 | 61 | ### Problem II.2 (1pt) 62 | **Report the relative speed of Salmon and STAR for the analyses of these four samples. Comment on your results based on the lecture material.** 63 | 64 | ``` 65 | Hint: you can parse the times from log files or use the `time` tool in the command line... 66 | e.g. time sleep 2 67 | ``` 68 | 69 | Student response 70 |
71 | 72 | ### Problem II.3 (1pt; graduate students only) 73 | **Plot the relationship between effective length, normalized read counts, TPM, and FPKM for runX from the Salmon output. Comment on the relative utility of each metric when analyzing gene expression data.** 74 | 75 | Student response 76 | 77 | ----- 78 | 79 |

80 | 81 | # Part III: Differential expression 82 | 83 | In 2014, a controversial manuscript from Lin et al. argued that, based on RNA-seq of several tissues from both mouse and human, fundamental physiological differences existed between these two organisms. Here, we will investigate these claims for a subset of the data analyzed. (Note: a copy of this manuscript is included as the `part3_4-manuscript.pdf` file associated with this homework.) The provided data is a counts matrix of the samples with the following conventions: 84 | ``` 85 | row: _ gene name (e.g. Stag2_STAG2) 86 | column: __ sample identifier (e.g. human_adipose_3) 87 | ``` 88 | 89 | ### Problem III.1 (1pt) 90 | **Perform a principle component analysis of the samples using the top 5,000 most variable genes as features. Indicate the species, tissue, and batch per sample plot. Do the results support the conclusions of the original paper? Do the results suggest the presence of a batch effect? ** 91 | 92 | ```{r import_piii, include=TRUE, echo=TRUE, eval = TRUE} 93 | 94 | # Import processed raw counts 95 | counts <- readRDS("../data/part3_counts.rds") 96 | 97 | # Perform a log TPM normalization 98 | log2tpm <- sapply(1:dim(counts)[2], function(idx){ 99 | log2((counts[,idx]/sum(counts[,idx]) * 1000000) + 1) 100 | }) 101 | colnames(log2tpm) <- colnames(counts) 102 | 103 | # Continue analysis here. 104 | ``` 105 | 106 | Student response 107 |
108 | 109 | ### Problem III.2 (1pt) 110 | **Run COMBAT on the samples to remove the batch effect. Visualize the results using a similar principle component analysis as the question above. Provide evidence that the batch effects are successfully adjusted. Do these results change the primary interpretation of the results?** 111 | 112 | Student response 113 |
114 | 115 | ### Problem III.3 (1pts) 116 | **Run DESeq2 adjusting for the batch effect to identify differentially-expressed genes between the lung and adipose tissue. Report the number of statistically-significant genes as well as whether they are more highly expressed in either adipose tissue or lung tissue.** 117 | 118 | Student response 119 |
120 | 121 | ### Problem III.4 (1pts) 122 | **Identify the top 5 most differentially expressed genes that are overexpressed in each of the tissues. Comment on the biological relevance of these. It may be useful to use data from the GTEx consortium when interpreting your result.** 123 | 124 | ``` 125 | GTEx link: https://www.gtexportal.org/home 126 | ``` 127 | Student response 128 |
129 | 130 | ### Problem III.5 (1pts) 131 | **Visualize the differential gene expression values by making a volcano and an MA plot to summarize the differences between the two tissues. Be sure to use the `lfcShrink` function to get more robust estimates of the fold-changes for genes. ** 132 | 133 | Student response 134 |
135 | 136 | ### Problem III.6 (1pts; graduate students only) 137 | **Rerun differential gene expression analyses without accounting for the batch effect. Compare the number of differentially expressed genes and anecdotes of top differentially expressed genes. Are the numbers of differentially expressed genes before/after the batch effect consistent with what you would have expected? Comment on the biological relevance of the top genes.** 138 | 139 | Student response 140 | 141 | ----- 142 | 143 |

144 | 145 | # Part IV: Gene ontology 146 | 147 | While the previous question identified genes that were differentially expressed between tissues and specific anecdotes were used for interpretation, we often want to assesss the differences between samples using a more wholistic approach. Pathway enrichment analyses provide a statistically principled way of examining many differentially expressed genes in an effort to identify biological patterns that explain the results. These patterns are defined using prior biological knowledge. 148 | 149 | ### Problem IV.1 (1.5pts) 150 | 151 | **Run the up and down regulated genes computed in problem III.3 separately on DAVID (http://david.abcc.ncifcrf.gov/) to see whether these genes are enriched in specific biological process, pathways, etc. For example, consider reporting the enrichments for the top 100 genes in the KEGG pathways. If you were to summarize the results in a paper, how would you describe the systematic biologial features that are different between these tissues? Your analysis should comment on the stability of enriched pathways (with at least 2 different input gene list sizes) and attempt to interpret the results in the differential physiological properties of the tissues.** 152 | 153 | Student response 154 | 155 | ### Problem IV.2 (0.5pts) 156 | 157 | **Describe in at least 3 but no more than 7 sentences the methodological differences between how approaches like DAVID and approaches like GSEA work in identifying enriched pathways from RNA-seq data.** 158 | 159 | Student response 160 | 161 | ### Problem IV.3 (1pt; graduate students only) 162 | **Run Gene Set Enrichment analysis (http://www.broadinstitute.org/gsea/index.jsp) using the summary statistics from problem III.3. What are the gene sets or experiments that best capture the differential expression data between these two cell types? Comment on the biological relevance of the results and compare them to the results produced from the DAVID analysis.** 163 | 164 | ``` 165 | Hint: the fgsea package (https://bioconductor.org/packages/release/bioc/html/fgsea.html) 166 | is, in my hands, easier to use than the original java distribution of gsea 167 | ``` 168 | 169 | Student response 170 | 171 | ----- 172 | 173 |

174 | 175 | # Part V: Python programming 176 | 177 | ### Problem V.1 (2pts) 178 | 179 | **RSeQC on RNA-seq generates many output files. One such file is called geneBodyCoverage.r which contains normalized reads mapped to each % of gene / transcript body. Suppose that we want to visualize all 12 samples from a recent RNA-seq library together to quickly perform quality control. These data files are present in the `part5` folder. Write a python program to extract the values and name from each file. The same script should then draw the gene body coverage for all the samples (3 rows x 4 cols) in one figure. We provide an example with 3 x 2 samples in one figure. Include your code and final figure in your report.** 180 | 181 | Student response 182 | 183 | ----- 184 | 185 |

186 | 187 | # Part VI: Batch effects and classification in the literature 188 | 189 | In a recent manuscript (published September 2019), Zhou et al. describe a modified version of RNA-seq called SILVER-seq that enables profiling of extracellular RNAs (exRNAs). The manuscript reports impressive performance in classifying patients with breast cancer compared to healthy controls as well as whether the cancer was recurrent. About three weeks ago, the original findings were challenged by Hartl and Gao where they argued that a batch effect confounded the interpretation of the work. The authors then rebutted the challenge. 190 | 191 | ### Problem VI.1 (1 pts) 192 | **In the main manuscript (see `1_original.pdf`), what bioinformatics methods were used in conjuction with the SILVER-seq protocol to predict patient status? Name the methods and describe their purpose (list at least 3).** 193 | 194 | Student response 195 | 196 |
197 | 198 | ----- 199 | The next questions are for graduate students only. In 3-5 sentences each, answer the following questions related to the short letters that comment on the manuscript. 200 | 201 | ### Problem VI.2 (0.5 pts; graduate students only) 202 | **Briefly summarize the Hartl and Gao response (see `2_Hartl_response.pdf`). Specifically, what evidence do Hartl and Gao offer that the interpretation of the original manuscript may be confounded by a batch effect? If you were the bioinformatician analyzing the original data, what steps, if any, could you take to eliminate the batch effect?** 203 | 204 | Student response 205 | 206 | ### Problem VI.3 (0.5 pts; graduate students only) 207 | **Summarize the response to Hartl (see `3_response_to_Hartl.pdf`). Specifically, how do the original authors argue that there is "a lack of between batch differences"? Do you find their rebuttal convincing?** 208 | 209 | Student response 210 | 211 | ### Problem VI.4 (0.5 pts; graduate students only) 212 | **Design a modified version of the study that utilizes both proper experimental setup and computational tools discussed in this lab/homework that would ameloirate the potential batch effect in assessing the efficacy of SILVER-seq. Comment specifically on batch design and analytical methods. Assume that you want to test the same number of samples (~130) but no more than 10 samples can be processed per batch.** 213 | 214 | Student response 215 | 216 | ### Problem VI.5 (0.0 pts; graduate students only) 217 | **After reading the primary manuscript and both responses, how do you interpret the efficacy of SILVER-seq as a tool for cancer diagnostics/prediction? What remaining questions do you think need to be answered before this technology could be confidently used with patients?** 218 | 219 | Optional response (feedback will be given but students will not receive points.) 220 | 221 | 222 | **Note:** you can access the raw SILVER-seq data from this study here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131512 223 | 224 |

225 | -------------------------------------------------------------------------------- /HW2/data/part3_counts.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/data/part3_counts.rds -------------------------------------------------------------------------------- /HW2/data/part5/.Rhistory: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/data/part5/.Rhistory -------------------------------------------------------------------------------- /HW2/data/part5/A.geneBodyCoverage.r: -------------------------------------------------------------------------------- 1 | A.ds <- c(0.0,0.0425833309242,0.0977599775842,0.156917206121,0.229372691429,0.285565609552,0.308986552749,0.338505124721,0.376918851724,0.405403361924,0.445907450359,0.465981591477,0.498704643049,0.528590140165,0.547986025711,0.551223862143,0.556732186896,0.589410762693,0.585725944721,0.610581231668,0.629381141827,0.65945121341,0.666874220284,0.678729238152,0.698058410036,0.70322427132,0.718019137926,0.748073642989,0.758456512699,0.765548175048,0.78169288143,0.795073418386,0.775183851736,0.769831192195,0.768165574416,0.777830160357,0.781759595093,0.795035613977,0.795389196389,0.798215631901,0.792320367904,0.777151904786,0.789758563255,0.799394239942,0.825959620444,0.829159652466,0.839504717768,0.852871911991,0.860779704814,0.861364561258,0.862389727875,0.86359724517,0.879332774421,0.883746995105,0.890144835362,0.891605864576,0.880753775437,0.871587318177,0.871282659118,0.880669271465,0.886789138126,0.892933466464,0.90244683477,0.903723289517,0.907545982392,0.910436907777,0.905769175175,0.937582697144,0.943233344378,0.946698007263,0.965471231957,0.983646257475,0.985783318471,0.995298910566,1.0,0.99896149065,0.988825461492,0.977381844544,0.946482299753,0.941454313372,0.925974519828,0.905017534574,0.92283897768,0.911557697311,0.897903634338,0.912676263056,0.928331735912,0.906567515339,0.847823911511,0.817177878639,0.785050802454,0.727345707977,0.699839664831,0.64017096488,0.532759744086,0.45725766818,0.370447626439,0.246529221696,0.131906253961,0.0216619262903) 2 | 3 | 4 | png("analysis/RSeQC/gene_body_cvg/11.1.FF/11.1.FF.geneBodyCoverage.curves.png") 5 | x=1:100 6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1) 7 | plot(x,V11.1.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1]) 8 | dev.off() 9 | -------------------------------------------------------------------------------- /HW2/data/part5/B.geneBodyCoverage.r: -------------------------------------------------------------------------------- 1 | B.ds <- c(0.05437399393153749,0.1237482015413325,0.21601447314064304,0.31159988034017577,0.39913673983959885,0.48560520805139673,0.5533127252524965,0.6127010356272882,0.6719041581788915,0.6988703542785509,0.7403809171070813,0.7739568939728486,0.8012934657188849,0.8395133833815297,0.8721776663485235,0.900668100685195,0.9151697317625607,0.9342868132024673,0.9509537172894201,0.9513668285873018,0.9543868146269889,0.9660678927050242,0.9772788786165045,0.9860966680437043,0.9927206940269805,0.9850852576247525,0.9898431601589767,1.0,0.9923360731634354,0.9827347967919771,0.9833758315645522,0.9767233151469394,0.9730053134660038,0.9491018390575364,0.9437171469679055,0.9271784498354677,0.9221641333922136,0.9137452100457272,0.9101554153193065,0.9036453510733771,0.8789156540691463,0.8569210387612359,0.8448410946024872,0.8432456302796336,0.8356101938774056,0.8329036026154218,0.8254675992535506,0.8051396743543355,0.7827889286172167,0.7757233009017223,0.764213165429707,0.750252852604738,0.7401387484152195,0.7253237225601504,0.706633997635294,0.6842974971153435,0.6743543355318452,0.6705793529822363,0.6561774384250488,0.6416473169133463,0.6256072023818003,0.6126867904101198,0.6033846635991966,0.5818743856750096,0.5746663057878317,0.5647516346386701,0.5459337027592985,0.5313893360304278,0.5185116597102523,0.5046368181882933,0.4995940113107024,0.49613242353879683,0.4799783472699041,0.473397056938133,0.467414065727432,0.4389948574766022,0.43559025057337,0.42788358808530036,0.4437954956623314,0.45071867120614256,0.3803188079602274,0.3624553056311344,0.35387968489579624,0.3365290103847633,0.31417826464764453,0.2925967606376159,0.2907876180572373,0.2802319121355005,0.273550905283551,0.26382142195757774,0.235744098918788,0.19189732047465063,0.16399094004188094,0.15240957848402398,0.09125486118035869,0.07495833273978261,0.05629709824926281,0.034615877719055825,0.016239547571902734,0.0) 2 | 3 | 4 | png("/liulab/jingxin/proj/cidc-rnaseq/results/pilot1/gene_body_coverage/5760-Frozen.geneBodyCoverage.curves.png") 5 | x=1:100 6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1) 7 | plot(x,5760_Frozen.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1]) 8 | dev.off() 9 | -------------------------------------------------------------------------------- /HW2/data/part5/C.geneBodyCoverage.r: -------------------------------------------------------------------------------- 1 | C.ds <- c(0.05924531390685111,0.1360890562331178,0.22388475075714168,0.3211099287877548,0.42092166653024476,0.5126790537775231,0.5868052713432103,0.6506834738479168,0.7158058443152984,0.7144470819350086,0.766129164279283,0.804960301219612,0.8332487517393795,0.8648604403699762,0.8888761561758206,0.9075550462470329,0.9080461651796676,0.9263485307358599,0.9424408610951952,0.9584677089301793,0.9607268560202996,0.9686666120978964,0.9817303757059834,0.9912089711058362,1.0,0.9885569288696079,0.9932061880985512,0.9959400834902186,0.9882949987722026,0.9829418024064828,0.989506425472702,0.9832528443971515,0.9723008921993943,0.9434722108537285,0.9432757632806744,0.9258410411721372,0.9063599901776214,0.889121715642138,0.8915118277809609,0.8886633379716788,0.8599492510436277,0.8361790947041008,0.8276827371695179,0.8317099124171237,0.8177457640992061,0.8158467708930179,0.7963166080052386,0.7774740116231481,0.7598101006793812,0.748088728820496,0.7371695178849145,0.7213391176229844,0.7064254727019726,0.6948678071539658,0.6807890644184333,0.6603748874519113,0.6519767537038553,0.6401571580584432,0.6195792747810428,0.6182859949251044,0.600949496603094,0.5913399361545387,0.57595154293198,0.5575509535892609,0.5434558402226406,0.527576328067447,0.5073585986739789,0.4940001637063109,0.4834083653924859,0.4716869935336007,0.4716869935336007,0.48010149791274453,0.47104853892117543,0.4683473847916837,0.4491282638945731,0.41966112793648197,0.4151264631251535,0.4105263157894737,0.41453712040599167,0.4252762543996071,0.36125071621511007,0.3407055741998854,0.3360890562331178,0.3200294671359581,0.29658672341818776,0.27728574936563805,0.28190226733240564,0.26772530081034623,0.26369812556274047,0.2641565032331996,0.24420070393713678,0.1830236555619219,0.15928624048457068,0.15830400261930097,0.08545469427846443,0.08291724645985103,0.06479495784562495,0.04320209544077924,0.017909470410084307,0.0) 2 | 3 | 4 | png("/liulab/jingxin/proj/cidc-rnaseq/results/pilot1/gene_body_coverage/5760-Norm.geneBodyCoverage.curves.png") 5 | x=1:100 6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1) 7 | plot(x,5760_Norm.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1]) 8 | dev.off() 9 | -------------------------------------------------------------------------------- /HW2/data/part5/D.geneBodyCoverage.r: -------------------------------------------------------------------------------- 1 | D.ds <- c(0.03190949160116182,0.08926939248065321,0.16172079702638475,0.24599354091767717,0.33469420916864706,0.42810716389413606,0.49326671135214184,0.5520890459651047,0.6275668758759368,0.6508033229744277,0.7005260699124571,0.7356854142546666,0.7641622488980968,0.8049479007982451,0.8472975443300226,0.8745759957751914,0.8853817561391749,0.9210692015518047,0.93242337456584,0.9368309873458859,0.9331342798529442,0.9345154672678895,0.9359372778420978,0.9515568825787581,0.9664655820283143,0.9625048240001625,0.9825117299372372,0.9966689009404262,0.9900067028212785,0.9844413300022343,1.0,0.9917941218288546,0.968192066296996,0.9401620864054597,0.9378871894867263,0.9166615887717587,0.9014279040480978,0.9019356935388865,0.9108524769971361,0.9090853695691914,0.8915970995064286,0.848759978063494,0.8489224707005464,0.8464647695651291,0.8508520707655435,0.8628155911685251,0.8405540998923486,0.8060244145187171,0.7749070745231856,0.7755164219121321,0.762577945686836,0.7768163630085512,0.7653403205167266,0.7444397050758638,0.7158613125342758,0.6896390632299474,0.682286271403327,0.6768427680620722,0.6664635508703511,0.672983567932078,0.6602888306623607,0.6597607295919403,0.6457660512258039,0.6263684926776756,0.6100176710742794,0.5998212580992424,0.5578575345804643,0.5324680600410294,0.5306603294538216,0.5306400178741901,0.5285073020128775,0.5400849024028599,0.5366116222858651,0.5460768183941666,0.5160969268580018,0.464728941969817,0.4725285885483314,0.4519732699612049,0.5525359007169988,0.5550139134320476,0.40078808928970405,0.38486381085857047,0.39629923019113195,0.38646842564946277,0.3570978815022444,0.3338208112444905,0.3503138139053074,0.30725326508642575,0.31422013690004674,0.3145451221741515,0.3110921536367883,0.23400970893506387,0.20463916478784555,0.2172932788982999,0.10539678670810229,0.1039140413949993,0.08451648284687101,0.0613003473280117,0.02211931021875571,0.0) 2 | 3 | 4 | png("/liulab/jingxin/proj/cidc-rnaseq/results/pilot1/gene_body_coverage/5812-FFPE.geneBodyCoverage.curves.png") 5 | x=1:100 6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1) 7 | plot(x,5812_FFPE.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1]) 8 | dev.off() 9 | -------------------------------------------------------------------------------- /HW2/data/part5/E.geneBodyCoverage.r: -------------------------------------------------------------------------------- 1 | E.ds <- c(0.0,0.062990547978,0.14598588179,0.227814668581,0.316621799474,0.396120483369,0.445220148361,0.482920555157,0.533937544867,0.574662000479,0.616511127064,0.630407992343,0.672194304858,0.700514477148,0.720384661402,0.729172648959,0.744017707586,0.779770878201,0.783016271835,0.809655419957,0.827916367552,0.851046901173,0.851779731993,0.873878320172,0.891920914094,0.891331658291,0.897891241924,0.911800071788,0.934090093324,0.93862766212,0.949814548935,0.960498923187,0.939351519502,0.935561737258,0.930635319454,0.941346614022,0.946267049533,0.96054379038,0.959673366834,0.960238693467,0.962455132807,0.943859176837,0.956990308686,0.96425879397,0.983728164633,0.987359416128,0.986877841589,0.997041756401,0.999009930605,1.0,0.991558985403,0.985361330462,0.99290201005,0.986351399856,0.998606125867,0.994131371141,0.985131012204,0.975969131371,0.971126465662,0.973570232113,0.984176836564,0.981646326872,0.980174682939,0.980049054798,0.978490667624,0.975397822446,0.968455372099,0.990344580043,0.986740248863,0.976058865757,0.981039124192,0.98432938502,0.975008973439,0.97118329744,0.963481095956,0.957256520699,0.946413615698,0.93068616894,0.899943168222,0.894212132089,0.873034816942,0.839291696578,0.848121560182,0.825598229241,0.803795764537,0.811593682699,0.807648360852,0.772771596076,0.724338956688,0.688238813113,0.654717037569,0.604172648959,0.56914034458,0.514653625269,0.426707944484,0.359790021536,0.286159966499,0.182923546303,0.0898630055037,0.00443586982532) 2 | 3 | 4 | png("analysis/RSeQC/gene_body_cvg/2.2.FF/2.2.FF.geneBodyCoverage.curves.png") 5 | x=1:100 6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1) 7 | plot(x,V2.2.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1]) 8 | dev.off() 9 | -------------------------------------------------------------------------------- /HW2/data/part5/F.geneBodyCoverage.r: -------------------------------------------------------------------------------- 1 | F.ds <- c(0.0,0.0517678183082,0.0879088850526,0.116247786023,0.178082191781,0.208394054212,0.201376589019,0.209447794991,0.221285563751,0.197654866265,0.208438894245,0.217160280699,0.22845996906,0.239759657422,0.249893504921,0.251799206331,0.258031970944,0.262560814295,0.269152299173,0.288747393673,0.300091922068,0.306773087012,0.314440732686,0.322556778692,0.339125170953,0.347779297357,0.348317377755,0.37353989642,0.393179830953,0.404770979531,0.412685245387,0.422213752438,0.420151110912,0.42160841199,0.414097706432,0.425531914894,0.435150102011,0.436674663139,0.443781808399,0.445777189875,0.435710602426,0.428625877183,0.443624868282,0.450709593525,0.470125327893,0.486088379705,0.498531488913,0.499719749793,0.514920521041,0.516153621954,0.520973925521,0.522834786898,0.537721677914,0.546129184136,0.549380086541,0.553841669843,0.536667937134,0.528843351344,0.518373203596,0.516400242136,0.520278905006,0.542272941282,0.561262695334,0.561912875816,0.572696903796,0.587583794812,0.591440037666,0.617985337309,0.633253368607,0.652467322826,0.686994148376,0.717642311055,0.724323475999,0.745913951976,0.76660762729,0.772661031769,0.790753985158,0.803286774432,0.781808398538,0.803667914714,0.811111360222,0.81185122077,0.846153846154,0.855794453288,0.865547160505,0.89471560209,0.942380557362,0.97829742394,0.958029728942,0.967356455844,1.0,0.980001345201,0.971481738896,0.955563527117,0.788556823532,0.719592852499,0.610160751519,0.453915655898,0.26785193821,0.0785821581508) 2 | 3 | 4 | png("analysis/RSeQC/gene_body_cvg/3.1.FF/3.1.FF.geneBodyCoverage.curves.png") 5 | x=1:100 6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1) 7 | plot(x,V3.1.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1]) 8 | dev.off() 9 | -------------------------------------------------------------------------------- /HW2/data/part5/G.geneBodyCoverage.r: -------------------------------------------------------------------------------- 1 | G.ds <- c(0.0,0.0670088790233,0.0910496273981,0.122482955446,0.194644839068,0.230141113049,0.222728714127,0.224770096718,0.228139368955,0.204792294276,0.216822578088,0.223997146028,0.242329950848,0.257590772158,0.264864436341,0.267599492627,0.276458696686,0.280719835104,0.290966386555,0.306821785318,0.312747740606,0.319724116062,0.326244648803,0.336312827018,0.350107023942,0.357856350087,0.365883145711,0.379934992865,0.400309180276,0.413746630728,0.414063738703,0.425856191533,0.429304740764,0.432614555256,0.43114793087,0.441335024576,0.456318376407,0.458439035992,0.467654986523,0.470172031077,0.463195655621,0.463017282385,0.473759315047,0.478773584906,0.496650547011,0.504895354368,0.512803234501,0.519938163945,0.531294593309,0.534624227049,0.53743856033,0.541521325511,0.56038925004,0.575491517362,0.569644839068,0.569803393055,0.558684794673,0.552719200888,0.546813064849,0.557138893293,0.55012287934,0.58076343745,0.592258601554,0.590613603932,0.611146345331,0.623751387347,0.631996194704,0.652865863326,0.665788013319,0.684338829872,0.7070516886,0.733629300777,0.738326462661,0.76115823688,0.785258443,0.790550182337,0.803630886317,0.821210559696,0.810944188996,0.821626763913,0.840990169653,0.833042651023,0.857519422863,0.878012525765,0.891549072459,0.918305057872,0.960936261297,0.98602742984,0.97449262724,0.979784366577,1.0,0.976553829079,0.953068019661,0.923240050737,0.792214999207,0.717674805771,0.609501347709,0.455545425717,0.268788647534,0.0652251466624) 2 | 3 | 4 | png("analysis/RSeQC/gene_body_cvg/3.2.FF/3.2.FF.geneBodyCoverage.curves.png") 5 | x=1:100 6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1) 7 | plot(x,V3.2.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1]) 8 | dev.off() 9 | -------------------------------------------------------------------------------- /HW2/data/part5/H.geneBodyCoverage.r: -------------------------------------------------------------------------------- 1 | H.ds <- c(0.0,0.053502292304,0.126145115693,0.199631490371,0.293835650509,0.37147967601,0.414057253712,0.458906409332,0.506311090109,0.536749405162,0.576843998972,0.596823936131,0.635376509895,0.658415615855,0.681643329105,0.700634632444,0.717169895789,0.750306745923,0.76030707754,0.791910198058,0.813187587568,0.841816516195,0.848413626151,0.875115029721,0.897250478772,0.899086809096,0.906610374645,0.935168834614,0.94834232845,0.953005695526,0.964713855796,0.97434733587,0.950667794165,0.939164822046,0.938252874707,0.946315318228,0.949478117409,0.969951335174,0.973412589848,0.973072682203,0.969901592592,0.949818025054,0.962616376916,0.966960562423,0.987728090465,0.987274189403,0.989607945548,1.0,0.996571907048,0.998824831497,0.993396672221,0.98651146981,0.996714916971,0.98645136419,0.99645791363,0.989726084181,0.981570373318,0.971186609297,0.960751030086,0.961990449424,0.968782384494,0.964898317872,0.961611162235,0.961706502185,0.962794621169,0.957528125285,0.95181601877,0.976782649787,0.974921448172,0.964112799595,0.977159864369,0.986397476393,0.983023271238,0.982148630835,0.978695666592,0.971853988941,0.958230739258,0.938000016581,0.904784407359,0.898011125758,0.878816706875,0.848345230101,0.860959119888,0.840199882276,0.82004584608,0.824408685055,0.833393438953,0.804476417871,0.744582203762,0.710160336923,0.674984040922,0.622748111854,0.585563459099,0.53281352335,0.414720488141,0.346836371776,0.271553046319,0.171222258147,0.0817166994139,0.0025949047015) 2 | 3 | 4 | png("analysis/RSeQC/gene_body_cvg/2.1.FF/2.1.FF.geneBodyCoverage.curves.png") 5 | x=1:100 6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1) 7 | plot(x,V2.1.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1]) 8 | dev.off() 9 | -------------------------------------------------------------------------------- /HW2/data/part5/J.geneBodyCoverage.r: -------------------------------------------------------------------------------- 1 | J.ds <- c(0.06359964400629835,0.14023413431916204,0.23799548161840214,0.33796125145478195,0.432340658588348,0.5244882590538783,0.603751625932772,0.665023618812898,0.72667898952557,0.7354008352159923,0.7798042034640925,0.8149654275347437,0.8384336277127404,0.8696241528034504,0.8874922982131854,0.903566783049223,0.9172314643663997,0.925734237009653,0.9452454302731567,0.9572807558020128,0.9648661600602451,0.9731772437872253,0.978585609639214,0.9874169918532211,0.9978229615937564,0.9894981858013281,0.9976449647429315,1.0,0.9931128910796193,0.9929485862942425,0.9901827890737318,0.9807352639145616,0.9606490039022386,0.9312521393852262,0.9306633805709591,0.9091805298829329,0.895680153351133,0.8734716231943589,0.8799068939549531,0.8732114739508455,0.8393099199014171,0.8119942493325119,0.810542890395016,0.8097213664681318,0.7991784760731157,0.80639419456425,0.7896898747176011,0.767481344560827,0.745491887451222,0.7326487300609297,0.719737112343397,0.7110563428493188,0.6969672075032519,0.6858492503594167,0.6689669336619429,0.6555760936537277,0.6492092832203737,0.6332443349079209,0.6181693708495927,0.6059971246662559,0.5955226945984802,0.5812555624015883,0.5666050523721503,0.5449715889641953,0.5306907647018553,0.5155884165126309,0.4963510645580886,0.48072841788183746,0.4714451975080441,0.45861573218319984,0.4599438625316629,0.47095228315191345,0.4647360854384884,0.46799479701512975,0.4470048606832341,0.419921955226946,0.4247141781337715,0.41989457109604983,0.41942904087081534,0.4271787499144246,0.36655028411035806,0.34831245293352503,0.3469021701923735,0.34059012802081196,0.31420551790237555,0.2895871842267406,0.3102622030533306,0.2872321489696721,0.2789758335044841,0.29233928938180326,0.27789416033408637,0.19863079345519272,0.17597042513863217,0.18688300130074623,0.08645170123913193,0.0844526596837133,0.06699527623742041,0.044923666735126995,0.013609913055384405,0.0) 2 | 3 | 4 | png("/liulab/jingxin/proj/cidc-rnaseq/results/pilot1/gene_body_coverage/5869-Norm.geneBodyCoverage.curves.png") 5 | x=1:100 6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1) 7 | plot(x,5869_Norm.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1]) 8 | dev.off() 9 | -------------------------------------------------------------------------------- /HW2/data/part5/K.geneBodyCoverage.r: -------------------------------------------------------------------------------- 1 | K.ds <- c(0.24300350877192983,0.3102035087719298,0.34040701754385966,0.3138526315789474,0.3466105263157895,0.46616140350877194,0.5223859649122807,0.5820350877192982,0.6295017543859649,0.476519298245614,0.47943859649122805,0.4977122807017544,0.5151438596491228,0.5557052631578947,0.5570526315789474,0.566540350877193,0.5841122807017544,0.6104701754385965,0.6334596491228071,0.6302315789473685,0.6406456140350877,0.6670315789473684,0.6811508771929825,0.7008561403508772,0.7120842105263158,0.6575157894736842,0.7679157894736842,0.8416,0.8528842105263158,0.8762947368421052,0.9393122807017544,0.9460771929824562,0.8846877192982456,0.7545263157894737,0.769740350877193,0.7541614035087719,0.7481263157894736,0.8174035087719298,0.840701754385965,0.8485894736842106,0.6943719298245614,0.5897824561403509,0.5955649122807017,0.6075228070175439,0.6514526315789474,0.8120982456140351,0.8025543859649122,0.7217122807017544,0.6504140350877193,0.6294175438596491,0.6284350877192982,0.6834245614035088,0.7216280701754386,0.7188491228070175,0.6484210526315789,0.5955649122807017,0.5813894736842106,0.545740350877193,0.5023438596491228,0.5965473684210526,0.5333333333333333,0.5995228070175439,0.5714807017543859,0.5334736842105263,0.440140350877193,0.41324912280701753,0.38669473684210526,0.4225122807017544,0.4583859649122807,0.5468912280701754,0.6273122807017544,0.8967859649122807,0.9177824561403509,0.9382456140350878,0.802161403508772,0.683059649122807,0.9293473684210526,0.7798456140350877,0.7568561403508772,0.7861052631578948,0.6334596491228071,0.5705824561403509,0.7172771929824562,0.7948350877192982,0.7307228070175439,0.6842666666666667,1.0,0.7305824561403509,0.6729263157894737,0.9635649122807017,0.8319438596491228,0.5459368421052632,0.4895438596491228,0.3022315789473684,0.16067368421052633,0.19508771929824562,0.17883508771929824,0.12654035087719298,0.01552280701754386,0.0) 2 | 3 | 4 | png("/liulab/jingxin/proj/cidc-rnaseq/results/pilot1/gene_body_coverage/5903-FFPE.geneBodyCoverage.curves.png") 5 | x=1:100 6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1) 7 | plot(x,5903_FFPE.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1]) 8 | dev.off() 9 | -------------------------------------------------------------------------------- /HW2/data/part5/L.geneBodyCoverage.r: -------------------------------------------------------------------------------- 1 | L.ds <- c(0.00958043229036,0.0898729133462,0.178319771831,0.260733859016,0.356027877974,0.425271929089,0.463688029528,0.507532226044,0.54897519783,0.577192209826,0.615262282247,0.638026731538,0.672395352739,0.701321896933,0.724463831334,0.734865643263,0.751258283701,0.783274613427,0.791764533177,0.809572742779,0.826321896933,0.84207141459,0.850404048877,0.861151888824,0.876978301597,0.881724827336,0.89009590918,0.916177334116,0.924188407013,0.929501160417,0.947907054777,0.955519671169,0.943034700668,0.939046640383,0.943995889607,0.948476777675,0.955858708721,0.969175544557,0.976959427341,0.984425243967,0.977742359422,0.960905824455,0.966711405643,0.972149987417,0.989360512261,0.990223834689,0.991569499203,0.995050750776,0.996183206107,0.99681584319,0.992464278724,0.990611805497,0.9963544725,0.9971933283,1.0,0.996148253782,0.986784525907,0.97556133434,0.968361155384,0.9663304253,0.967784442021,0.964352123703,0.963268601627,0.960203282722,0.959228112854,0.958130609848,0.951818219948,0.976742722926,0.974484802729,0.972129016022,0.977438274194,0.986277717194,0.981076811229,0.982751027598,0.982233733188,0.976679808741,0.966966557615,0.950643821827,0.922479238319,0.907792271342,0.890826412773,0.862595419847,0.869683751363,0.854395604396,0.834116265414,0.837048765484,0.839163381148,0.815661437799,0.7626247798,0.727260017336,0.691741464642,0.637100494925,0.608516483516,0.548730531555,0.452667561446,0.379204764701,0.299625311076,0.190025305483,0.0853361015575,0.0) 2 | 3 | 4 | png("analysis/RSeQC/gene_body_cvg/8.1.FF/8.1.FF.geneBodyCoverage.curves.png") 5 | x=1:100 6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1) 7 | plot(x,V8.1.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1]) 8 | dev.off() 9 | -------------------------------------------------------------------------------- /HW2/data/part5/M.geneBodyCoverage.r: -------------------------------------------------------------------------------- 1 | M.ds <- c(0.0,0.0726672876088,0.156201062425,0.238977424318,0.323459671174,0.397884020293,0.441138439388,0.487947922744,0.536934115311,0.573897230666,0.609485402742,0.626875194044,0.667819844901,0.691446351624,0.715697212243,0.730285940437,0.737962422766,0.769732824297,0.782329078856,0.805419937701,0.820059842444,0.842423994296,0.84264917111,0.865808265354,0.879820404431,0.88796088747,0.891809363944,0.918025404039,0.929434362664,0.936431902778,0.947581566889,0.967212891031,0.965165829077,0.955302402227,0.949000863178,0.948492509459,0.939256848275,0.950781807078,0.962207824553,0.960314292245,0.957649699935,0.939560495798,0.949614981764,0.962576295705,0.982231502236,0.9807269117,0.978133966558,0.990211632088,0.988556923675,0.991893634661,0.992964930417,0.989003865535,0.99911293982,0.99776187893,1.0,0.998004114595,0.989635042971,0.975407962389,0.971010190957,0.971460544587,0.97446972566,0.975397727079,0.980938441435,0.977461847883,0.981405853915,0.979481615678,0.973664547958,0.98028338161,0.98070644108,0.97536360938,0.98429221127,0.988396570489,0.985721742869,0.984711858971,0.982173502148,0.97221113397,0.958543583655,0.94845156822,0.917035990761,0.915401752967,0.900175023797,0.848357062193,0.881785583907,0.858977901966,0.837766928349,0.831748566204,0.823962907237,0.801598755386,0.758262453813,0.721469926954,0.690108937814,0.639109800991,0.594739733131,0.538053175846,0.437999611058,0.365137852564,0.286994674227,0.193150530701,0.0997908585037,0.0150663759839) 2 | 3 | 4 | png("analysis/RSeQC/gene_body_cvg/9.1.FF/9.1.FF.geneBodyCoverage.curves.png") 5 | x=1:100 6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1) 7 | plot(x,V9.1.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1]) 8 | dev.off() 9 | -------------------------------------------------------------------------------- /HW2/data/part5_example.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/data/part5_example.pdf -------------------------------------------------------------------------------- /HW2/papers/part3_4-manuscript.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/papers/part3_4-manuscript.pdf -------------------------------------------------------------------------------- /HW2/papers/part6/1_original.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/papers/part6/1_original.pdf -------------------------------------------------------------------------------- /HW2/papers/part6/2_Hartl_response.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/papers/part6/2_Hartl_response.pdf -------------------------------------------------------------------------------- /HW2/papers/part6/3_response_to_Hartl.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/papers/part6/3_response_to_Hartl.pdf -------------------------------------------------------------------------------- /HW3/Homework3_release.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Homework 3" 3 | author: "" 4 | date: "February 23, 2020" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE, cache = TRUE, message = FALSE) 10 | ``` 11 | 12 | Spring 2020 STAT115/215 BIO/BST282 13 | Due: 3/8/2020 midnight 14 | 15 | # HOMEWORK 3: Classification and scRNA-seq 16 | 17 | ## Part I: Sample classification 18 | 19 | We provide you z-score normalized expression data of 50 breast tumor samples, 50 normal breast samples (your training and cross-validation data), and 20 samples without diagnosis (your testing data). We want to use the 100 samples with known diagnosis to train machine learning models in order to predict the 20 unknown samples. 20 | 21 | You will need the following libraries in R: `ggplot2` and `ggfortify` for plotting, `MASS` and `caret` for machine learning, and `pROC` is for evaluating testing performance. The [YouTube video on caret](https://youtu.be/z8PRU46I3NY) and the [package documentation](http://topepo.github.io/caret/index.html) might be helpful. 22 | 23 | ```{r prepare} 24 | library(ggplot2) 25 | library(ggfortify) 26 | library(pROC) 27 | library(caret) 28 | library(e1071) # KNN 29 | library(kernlab) #SVM 30 | 31 | #### read in data for question 1 32 | dataset <- read.table(file = "q1_data/BRCA_zscore_data.txt", sep = "\t", header = TRUE, row.names = 1) 33 | phenotype <- read.table(file = "q1_data/BRCA_phenotype.txt",sep = "\t", header = TRUE, row.names = 1) 34 | phenotype <- as.character(phenotype[rownames(dataset),'phenotype']) # the labels 35 | ``` 36 | 37 | 38 | ### 1. Run PCA for dimension reduction on the 100 samples with known labels, and draw these 100 samples in a 2D plot. Do cancer and normal separate from the first two PCs? Would this be sufficient to classify the unknown samples? 39 | 40 | 41 | ```{r} 42 | # your code here 43 | ``` 44 | 45 | 46 | ### 2. Draw a plot showing the cumulative % variance captured from the top 100 PCs. How many PCs are needed to capture 90% of the variance? 47 | 48 | 49 | ```{r} 50 | # your code here 51 | ``` 52 | 53 | 54 | ### 3. Apply machine learning methods (KNN, logistic regression, Ridge regression, LASSO, ElasticNet, random forest, and support vector machines) on the top 25 PCs of the training data and 5-fold cross validation to classify the samples. `caret` and `MASS` already implemented all of the machine learning methods, including cross-validation. In order to get consistent results from different runs, use `set.seed(115)` right before each `train` command. 55 | 56 | ```{r} 57 | # your code here 58 | ``` 59 | 60 | ### 4. Summarize the performance of each machine learning method, in terms of accuracy and kappa. 61 | 62 | ```{r} 63 | # your code here 64 | ``` 65 | 66 | 67 | ### 5. For Graduate students: Compare the performance difference between logistic regression, Ridge, LASSO, and ElasticNet. In LASSO, how many PCs have non-zero coefficient? In ElasticNet, what is the lamda for Ridge and LASSO, respectively? 68 | 69 | ```{r} 70 | # your code here 71 | ``` 72 | 73 | 74 | ### 6. Use the PCA projections in Q1 to obtain the first 25 PCs of the 20 unknown samples. Use one method that performs well in Q4 to make predictions for unknown sampels (`q1_data/unknown_samples.txt`). Caret already used the hyper-parameters learned from cross-validation to train the parameters of each method on the full 100 training data. You just need to call this method to make the predictions. 75 | 76 | ```{r} 77 | # your code here 78 | ``` 79 | 80 | 81 | ### 7. For Graduate students: Can you find out the top 3 genes that are most important in this prediction method in Q6? Do they have some known cancer relevance? 82 | 83 | ```{r} 84 | # your code here 85 | ``` 86 | 87 | 88 | ### 8. Suppose a pathologist later made diagnosis on the 20 unknown samples (load the `q1_data/diagnosis.txt` file). Based on this gold standard, draw an ROC curve of your predictions in Q6. What is the prediction AUC? 89 | 90 | ```{r} 91 | # your code here 92 | ``` 93 | 94 | 95 | ## Part II. Single cell RNA-seq 96 | 97 | For this exercise, we will be analyzing a single cell RNA-Seq dataset of human peripheral blood mononuclear cells (PBMC) from 10X Genomics (droplet-based) from a healthy donor (Next GEM). The raw data can be found below which is already processed by CellRanger into the expression matrix format. 98 | 99 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_v3_nextgem 100 | 101 | Please provide code and text answer for each question. 102 | 103 | ### 1. Load data: Read the 10X data and create a Seurat (Butler et al., Nature Biotechnology 2018) Object. Please report number of cells, number of genes, and dropout rate. 104 | 105 | ```{r} 106 | # your code here 107 | ``` 108 | 109 | 110 | ### 2. QC genes: We want to filter genes that are detected in very few cells. Let’s keep all genes expressed in >= 10 cells. How do the above summary statistics change after filtering? 111 | 112 | ```{r} 113 | # your code here 114 | ``` 115 | 116 | 117 | ### 3. QC cells: Next we will filter cells with high proportion of mitochondrial reads (potential dead cells) or outlier number of genes (potential poor reactions or multiplets). What proportion of the counts from your filtered dataset map to mitochondrial genes? Remove those cells with high mitochondrial rate (> 5%). Outlier cells with extremely high or low gene coverage should be removed, and the cutoff depends on the scRNA-seq technology and the distribution of each dataset. What is the distribution of number of genes and UMIs in your dataset? Let’s filter cells with > 1 stdev of covered genes from the average. Keep the remaining cells for downstream analysis. 118 | 119 | 120 | ```{r} 121 | # your code here 122 | ``` 123 | 124 | 125 | ### 4. Dimension reduction: Use global-scaling normalization method in Seurat with the scaling factor 10000, so all the cells will be normalized to have the same sequencing depth to 10K. Use the Seurat function "FindVariableGenes" to select 2000 genes (by default) showing expression variability, then perform PCA on these genes. Provide summary plots, statistics, and tables to show 126 | - How many PCs are statistically significant? 127 | - The top 5 genes with the most positive and negative coefficients in each of the significant PCs, 128 | - How much variability is explained in each of the significant PCs. 129 | 130 | ```{r} 131 | # your code here 132 | ``` 133 | 134 | 135 | ### 5. For GRADUATE students: Sometimes scRNA-seq data might have significant PCs that are heavily weighted by cell cycle genes, which need to be removed before downstream analyses. Check the top PCs in this data to see whether cell cycle components need to be removed. Provide plots and other quantitative arguments to support your case. 136 | 137 | 138 | ```{r} 139 | # your code here 140 | ``` 141 | 142 | 143 | ### 6. Visualization: Use Seurat to run UMAP on the top 20 PCs (regardless of how many PCs are statistically significant) from Q4. Visualize the cells and their UMAP coordinates and comment on the number of cell clusters that appear on this data. Describe the difference between PCA and UMAP on 2D plots? 144 | 145 | ```{r} 146 | # your code here 147 | ``` 148 | 149 | 150 | ### 7. For GRADUATE students: Use Seurat to run tSNE on the top 20 PCs (regardless of how many PCs are statistically significant) from Q4. Comments on the difference between tSNE and UMAP runtime and results. 151 | 152 | ```{r} 153 | # your code here 154 | ``` 155 | 156 | 157 | ### 8. For GRADUATE students: Try different `resolution` in clustering and draw the resulting clusters in different colors on UMAP. How does resolution influence the number of clusters and the number of cells assigned to each cluster? 158 | 159 | 160 | ```{r} 161 | # your code here 162 | ``` 163 | 164 | 165 | ### 9. Clustering: Use resolution = 0.6 to cluster the cells. How many clusters to you get and how many cells are assigned to each cluster? Use Seurat to calculate differential expression between clusters (one vs the rest), identify putative biomarkers for each cell subpopulation. Visualize the gene expression values of these potential markers on your UMAP plots. 166 | 167 | ```{r} 168 | # your code here 169 | ``` 170 | 171 | 172 | ### 10. Annotation: For GRADUATE students: Based on the expression characteristics of your cell clusters, provide putative biological annotation (e.g. MS4A1, CD79A genes are high in B-cells) for the clusters. This paper (Newman et al, Nat Methods 2015, https://www.nature.com/articles/nmeth.3337) may serve as a good resource as well as this tutorial PBMC (https://satijalab.org/seurat/pbmc3k_tutorial.html). 173 | 174 | 175 | ```{r} 176 | # your code here 177 | ``` 178 | 179 | ## Rules for submitting the homework: 180 | 181 | Please submit your solution directly on the canvas website. Please provide both your code in this Rmd document and an html file for your final write-up. Please pay attention to the clarity and cleanness of your homework. 182 | 183 | The teaching fellows will grade your homework and give the grades with feedback through canvas within one week after the due date. Some of the questions might not have a unique or optimal solution. TFs will grade those according to your creativity and effort on exploration, especially in the graduate-level questions. 184 | -------------------------------------------------------------------------------- /HW3/q1_data/BRCA_phenotype.txt: -------------------------------------------------------------------------------- 1 | sample phenotype 2 | TCGA-A7-A13G-01 Tumor 3 | TCGA-A7-A13G-11 Normal 4 | TCGA-AC-A23H-01 Tumor 5 | TCGA-AC-A23H-11 Normal 6 | TCGA-AC-A2FB-01 Tumor 7 | TCGA-AC-A2FB-11 Normal 8 | TCGA-AC-A2FF-01 Tumor 9 | TCGA-AC-A2FF-11 Normal 10 | TCGA-AC-A2FM-01 Tumor 11 | TCGA-AC-A2FM-11 Normal 12 | TCGA-AC-A6IX-01 Tumor 13 | TCGA-AC-A6IX-06 Tumor 14 | TCGA-A7-A0CE-01 Tumor 15 | TCGA-A7-A0CE-11 Normal 16 | TCGA-A7-A0CH-01 Tumor 17 | TCGA-A7-A0CH-11 Normal 18 | TCGA-A7-A0DB-01 Tumor 19 | TCGA-A7-A0DB-11 Normal 20 | TCGA-BH-A0AY-01 Tumor 21 | TCGA-BH-A0AY-11 Normal 22 | TCGA-BH-A0BV-01 Tumor 23 | TCGA-BH-A0BV-11 Normal 24 | TCGA-BH-A0DZ-01 Tumor 25 | TCGA-BH-A0DZ-11 Normal 26 | TCGA-A7-A0D9-01 Tumor 27 | TCGA-A7-A0D9-11 Normal 28 | TCGA-BH-A0B3-01 Tumor 29 | TCGA-BH-A0B3-11 Normal 30 | TCGA-BH-A0B8-01 Tumor 31 | TCGA-BH-A0B8-11 Normal 32 | TCGA-BH-A0BA-01 Tumor 33 | TCGA-BH-A0BA-11 Normal 34 | TCGA-BH-A0BJ-01 Tumor 35 | TCGA-BH-A0BJ-11 Normal 36 | TCGA-BH-A0BM-01 Tumor 37 | TCGA-BH-A0BM-11 Normal 38 | TCGA-BH-A0C0-01 Tumor 39 | TCGA-BH-A0C0-11 Normal 40 | TCGA-BH-A0DK-01 Tumor 41 | TCGA-BH-A0DK-11 Normal 42 | TCGA-BH-A0DP-01 Tumor 43 | TCGA-BH-A0DP-11 Normal 44 | TCGA-BH-A0E0-01 Tumor 45 | TCGA-BH-A0E0-11 Normal 46 | TCGA-BH-A0E1-01 Tumor 47 | TCGA-BH-A0E1-11 Normal 48 | TCGA-BH-A0H7-01 Tumor 49 | TCGA-BH-A0H7-11 Normal 50 | TCGA-BH-A0H9-01 Tumor 51 | TCGA-BH-A0H9-11 Normal 52 | TCGA-BH-A0HK-01 Tumor 53 | TCGA-BH-A0HK-11 Normal 54 | TCGA-BH-A0BC-01 Tumor 55 | TCGA-BH-A0BC-11 Normal 56 | TCGA-BH-A0DH-01 Tumor 57 | TCGA-BH-A0DH-11 Normal 58 | TCGA-BH-A0DQ-01 Tumor 59 | TCGA-BH-A0DQ-11 Normal 60 | TCGA-BH-A0B7-01 Tumor 61 | TCGA-BH-A0B7-11 Normal 62 | TCGA-BH-A0BQ-01 Tumor 63 | TCGA-BH-A0BQ-11 Normal 64 | TCGA-BH-A0BW-01 Tumor 65 | TCGA-BH-A0BW-11 Normal 66 | TCGA-BH-A0DL-01 Tumor 67 | TCGA-BH-A0DL-11 Normal 68 | TCGA-BH-A0H5-01 Tumor 69 | TCGA-BH-A0H5-11 Normal 70 | TCGA-BH-A0DO-01 Tumor 71 | TCGA-BH-A0DO-11 Normal 72 | TCGA-BH-A0DT-01 Tumor 73 | TCGA-BH-A0DT-11 Normal 74 | TCGA-BH-A18J-01 Tumor 75 | TCGA-BH-A18J-11 Normal 76 | TCGA-A7-A13E-01 Tumor 77 | TCGA-A7-A13E-11 Normal 78 | TCGA-A7-A13F-01 Tumor 79 | TCGA-A7-A13F-11 Normal 80 | TCGA-BH-A0AU-01 Tumor 81 | TCGA-BH-A0AU-11 Normal 82 | TCGA-BH-A0AZ-01 Tumor 83 | TCGA-BH-A0AZ-11 Normal 84 | TCGA-BH-A0B5-01 Tumor 85 | TCGA-BH-A0B5-11 Normal 86 | TCGA-BH-A0BS-01 Tumor 87 | TCGA-BH-A0BS-11 Normal 88 | TCGA-BH-A0BT-01 Tumor 89 | TCGA-BH-A0BT-11 Normal 90 | TCGA-BH-A0BZ-01 Tumor 91 | TCGA-BH-A0BZ-11 Normal 92 | TCGA-BH-A0C3-01 Tumor 93 | TCGA-BH-A0C3-11 Normal 94 | TCGA-BH-A0DD-01 Tumor 95 | TCGA-BH-A0DD-11 Normal 96 | TCGA-BH-A0DG-01 Tumor 97 | TCGA-BH-A0DG-11 Normal 98 | TCGA-BH-A0DV-01 Tumor 99 | TCGA-BH-A0DV-11 Normal 100 | TCGA-BH-A0HA-01 Tumor 101 | TCGA-BH-A0HA-11 Normal 102 | -------------------------------------------------------------------------------- /HW3/q1_data/diagnosis.txt: -------------------------------------------------------------------------------- 1 | sample phenotype 2 | Test1 Normal 3 | Test2 Tumor 4 | Test3 Tumor 5 | Test4 Normal 6 | Test5 Tumor 7 | Test6 Normal 8 | Test7 Tumor 9 | Test8 Normal 10 | Test9 Tumor 11 | Test10 Normal 12 | Test11 Tumor 13 | Test12 Normal 14 | Test13 Tumor 15 | Test14 Normal 16 | Test15 Tumor 17 | Test16 Normal 18 | Test17 Tumor 19 | Test18 Normal 20 | Test19 Tumor 21 | Test20 Tumor 22 | -------------------------------------------------------------------------------- /HW4/README.md: -------------------------------------------------------------------------------- 1 | # Homework-4 2 | 3 | - Due: March 29, 2020 at 11:59pm 4 | -------------------------------------------------------------------------------- /HW4/Stat115_Homework4.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Stat 115 2020: Homework 5' 3 | author: '(Your name)' 4 | date: "Due: March 29, 2020 at 11:59pm" 5 | output: html_document 6 | --- 7 | 8 | Androgen receptor (AR) is a transcription factor frequently over-activated in prostate cancer. To study AR regulation in prostate cancer, scientists conducted AR ChIP-seq in prostate tumors and normal prostate tissues. Since the difference between individual patients could be quite big, this study actually included many more tumor and normal samples. However, for the purpose of this HW, we will only use the ChIP-seq data from 1 prostate tumor samples (tumor) and 1 normal prostate tissues (normal). 9 | 10 | Hint: It helps to read the MACS README and Nature Protocol paper: 11 | 12 | https://pypi.python.org/pypi/MACS2/2.0.10.09132012 13 | 14 | https://search-proquest-com.ezp-prod1.hul.harvard.edu/docview/1036979599/fulltextPDF/7A4604F292854FFAPQ/1?accountid=11311 15 | 16 | # Part I. Call AR ChIP-seq peaks 17 | 18 | ## 1. For GRADUATE students: 19 | 20 | Usually we use BWA to map the reads to the genome for ChIP-seq experiment. We will give you one example ChIP-seq single-end sequenced .fastq file with only 1M reads. Run BWA on this file to Hg38 of the human genome assembly. Report the commands, logs files, and a snapshot / screenshot of the output to demonstrate your alignment procedure. What proportion of the reads are successfully mapped (to find at least one location) and what proportions are uniquely mapped (to find a single location) in the human genome in this test sample? We will save you some time and directly give you the BWA mapped BAM files for all the 4 samples. 21 | 22 | Hint: 23 | 1). Target sample fastq file is stored as /n/stat115/2020/HW4/tumor_1M.fastq on the Odyssey 24 | 2). The index file is stored as /n/stat115/2020/HW1/bwa_hg38_index/hg38.fasta on the Odyssey 25 | 26 | ```{r, engine='bash', eval=FALSE} 27 | # your bash code here 28 | ``` 29 | 30 | ## 2. For GRADUATE students: 31 | 32 | In ChIP-Seq experiments, when sequencing library preparation involves a PCR amplification step, it is common to observe multiple reads where identical nucleotide sequences are disproportionally represented in the final results. This is especially a problem in tissue ChIP-seq experiments (as compared to cell lines) when input cell numbers are low. Removing these duplicated reads can improve the peak calling accuracy. Thus, it may be necessary to perform a duplicate read removal step, which flags identical reads and subsequently removes them from the dataset. Run this on your test sample (1M reads) (macs2 filterdup). What % of reads are redundant? When doing peak calling, MACS filters duplicated reads by default. 33 | 34 | Hint: 35 | The test samples are stored as /n/stat115/2020/HW4/tumor.bam and /n/stat115/2020/HW4/normal.bam on the Odyssey. 36 | 37 | ```{r, engine='bash', eval=FALSE} 38 | # your bash code here 39 | ``` 40 | 41 | ## 3. For both: 42 | 43 | For many ChIP-seq experiments, usually chromatin input without enriching for the factor of interest is generated as control. However, in this experiment, we only have ChIP (of both tumor and normal) and no control samples. Without control, MACS2 will use the signals around the peaks to infer the chromatin background and estimate the ChIP enrichment over background. In ChIP-seq, + strand reads and – strand reads are distributed to the left and right of the binding site, and the distance between the + strand reads and – strand reads can be used to estimate the fragment length from sonication (note: with PE seq, insert size could be directly estimated). What is the estimated fragment size in each? Use MACS2 to call peaks from tumor1 and normal1 separately. How many peaks do you get from each condition with FDR < 0.05 and fold change > 5? 44 | 45 | ```{r, engine='bash', eval=FALSE} 46 | # your bash code here 47 | ``` 48 | 49 | ## 4. For both: 50 | 51 | Now we want to see whether AR has differential binding sites between prostate tumors and normal prostates. MACS2 does have a function to call differential peaks between conditions, but requires both conditions to have input control. Since we don’t have input controls for these AR ChIP-seq, we will just run the AR tumor ChIP-seq over the AR normal ChIP-seq (pretend the latter to be input control) to find differential peaks. How many peaks do you get with FDR < 0.01 and fold change > 6? 52 | 53 | ```{r, engine='bash', eval=FALSE} 54 | # your bash code here 55 | ``` 56 | 57 | 58 | 59 | # Part II. Evaluate AR ChIP-seq data quality 60 | 61 | ## 5. For both: 62 | 63 | Cistrome Data Browser (http://cistrome.org/db/) has collected and pre-processed most of the published ChIP-seq data in the public. Play with Cistrome DB. Biological sources indicate whether the ChIP-seq is generated from a cell line (e.g. VCaP, LNCaP, PC3, C4-2) or a tissue (Prostate). Are there over 10 AR ChIP-seq data available in human prostate tissues? 64 | 65 | ## 6. For both: 66 | 67 | Doing transcription factor ChIP-seq in tissues could be a tricky experiment, so sometimes even published studies have very bad data. Look at a few AR ChIP-seq samples in the prostate tissue on Cistrome and inspect their QC reports. Can you comment on what QC measures tell you whether a ChIP-seq is of good or bad quality? Include a screen shot of a good AR ChIP-seq vs a bad AR ChIP-seq. 68 | 69 | ## 7. For GRADUATE students: 70 | 71 | For Graduate Students: Antibody is one important factor influencing the quality of a ChIP-seq experiment. Click on the GEO (GSM) ID of some good quality vs bad quality ChIP-seq data, and see where they got their AR antibodies. If you plan to do an AR ChIP-seq experiment, which company and catalog # would you use to order the AR antibody? 72 | 73 | 74 | 75 | # Part III Find AR ChIP-seq motifs 76 | 77 | ## 8. For GRADUATE students: 78 | 79 | We want to see in prostate tumors, which other transcription factors (TF) might be collaborating with AR. Try any of the following motif finding tools to find TF motifs enriched in the differential AR peaks you identified above. Did you find the known AR motif, and motifs of other factors that might interact with AR in prostate cancer in gene regulation? Describe the tool you used, what you did, and what you found. Note that finding the correct AR motif is usually an important criterion for AR ChIP-seq QC. 80 | 81 | Cistrome: http://cistrome.org/ap/root (Register a free account). 82 | Weeder: http://159.149.160.88/pscan_chip_dev/ 83 | HOMER: http://homer.ucsd.edu/homer/motif/ 84 | MEME: http://meme-suite.org/tools/meme-chip 85 | 86 | ## 9. For both: 87 | 88 | Look at the AR binding distribution in Cistrome DB from a few good AR ChIP-seq data in prostate. Does AR bind mostly in the gene promoters, exons, introns, or intergenic regions? Also, look at the QC motifs to see what motifs are enriched in the ChIP-seq peaks. Do you see similar motifs here as those you found in your motif analyses? 89 | 90 | 91 | 92 | # Part IV. Identify AR-interacting transcription factors 93 | 94 | ## 10. For GRADUATE students: 95 | 96 | Sometimes members of the same transcription factor family (e.g. GATA1, 2, 3, 4, 5, 6) have similar binding motifs, similar binding sites (when they are expressed, although they might be expressed in very different tissues), and related functions. Therefore, to confirm that we have found the correct TFs interacting with AR in prostate tumors, in addition to looking for motifs enriched in the AR ChIP-seq, we also want to see whether the TFs are highly expressed in prostate tumor. For this, we will use the Exploration Component on TIMER (http://timer.cistrome.org/). First, try the “Gene DE” module to look at differential expression of genes in tumors. Check the top motifs you found before, and see which member of the TF family that recognizes the motif is highly expressed in prostate tissues or tumors. Another way is to see whether the TF family member and AR have correlated expression pattern in prostate tumors. Go to the “Gene Corr” tab, select prostate cancer (PRAD), enter AR as your interested gene and genes (you can under multiple genes here) that are potential AR collaborators based on the motif, correct the correlation by tumor purity, and see whether the candidate TF is correlated with AR in prostate tumors. Based on the motif and expression evidences, which factor in each motif family is the most likely collaborator of AR in prostate cancer? 97 | 98 | Note: When we conduct RNA-seq on prostate tumors, each tumor might contain cancer cells, normal prostate epithelia cells, stromal fibroblasts, and other immune cells. Therefore, genes that are highly expressed in cancer cells (including AR) could be correlated in different tumors simply due to the tumor purity bias. Therefore, when looking for genes correlated with AR just in the prostate cancer cells, we should correct this tumor purity bias. 99 | 100 | ## 11. For both: 101 | 102 | Besides looking for motif enrichment, another way to find TFs that might interact with AR is to see whether there are other TF ChIP-seq data which have significant overlap with AR ChIP-seq. Take the differential AR ChIP-seq peaks (in .bed format) that are significantly higher in tumor than normal, and run this on the Cistrome Toolkit (http://dbtoolkit.cistrome.org/). The third function in Cistrome Toolkit looks through tens of thousands of published ChIP-seq data to see whether any have significant overlap with your peak list. You should see AR enriched in the results (since your input is a list of AR ChIP-seq peaks after all). What other factors did you see enriched? Do they agree with your motif analyses before? 103 | 104 | 105 | 106 | # PART V. Find AR direct target genes and pathways 107 | 108 | ## 12. For GRADUATE students: 109 | 110 | Now we try to see what target genes these AR binding sites regulate. Among the differentially expressed genes in prostate cancer, only a subset might be directly regulated by AR binding. One simple way of getting the AR target genes is to look at which genes have AR binding in its promoters. Write a python program that takes two input files: 1) the AR differential ChIP-seq peaks in tumor over normal; 2) refGene annotation. The program outputs to a file containing genes that have AR ChIP-seq peak (in this case, stronger peak in tumor) within 3KB + / - from the transcription start site (TSS) of the gene. How many putative AR target genes in prostate cancer do you get using this approach? 111 | 112 | Note: From UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/), download the human RefSeq annotation table (find the file refGene.txt.gz for Hg38). To understand the columns in this file, check the query annotation at http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.sql. 113 | 114 | Hint: TSS is different for genes on positive or negative strand, i.e. TSS is “txStart” for genes on the positive strand, “txEnd” for genes in negative strand. When testing your python code, try smaller number of gene annotations or smaller number of peaks to check your results before moving forward. 115 | 116 | ## 13. For GRADUATE students: 117 | 118 | Now overlap the putative AR target genes you get from above with up regulated genes in prostate cancer(up_regulated_genes_in_prostate_cancer.txt). Try to run DAVID on 1) the AR target genes from binding alone and 2) the AR target genes by overlapping AR binding with differential expression. Are there enriched GO terms or pathways? 119 | 120 | ## 14. For both: 121 | 122 | Another way of getting the AR target genes is to consider the number of AR binding sites within 100KB of TSS, but weight each binding site by an exponential decay of its distance to the gene TSS (i.e. peaks closer to TSS have higher weights). For this, we have calculated regulatory potential score for each refseq gene(AR_peaks_regulatory_potential.txt). Select the top 1500 genes with highest regulatory potential score, try to run DAVID both with and without differentially expression, and see the enriched GO terms. 123 | 124 | Note: Basically this regulatory potential approach assumes that there are stronger AR targets (e.g. those genes with many AR binding sites within 100KB and have stronger differential expression) and weaker AR targets, instead of a binary Yes / No AR targets. 125 | 126 | ## 15. For GRADUATE students: 127 | 128 | Comment on the AR targets you get from promoter binding (your code) and distance weighted binding. Which one gives you better function / pathway enrichment? Does considering differential expression help? 129 | 130 | 131 | 132 | # PART VI. ATAC-seq 133 | 134 | The molecular mechanism of a type of T cell leukemia is poorly understood. Since it is unclear which transcription factors (TF) are involved, scientists can’t do TF ChIP-seq. Instead, ATAC-seq was performed on the T cells from both the normal donors and the T cell leukemia patients on many individuals. For this HW, we will only select 3 normal (norm1, norm2, norm3) and 3 leukemia (leuk1, leuk2, leuk3) samples, and give you the read mapping BAM files (to Hg38). This part of the HW will show you how epigenetic profiling can help identify key transcription factors and the regulatory mechanisms of biological processes and diseases. 135 | 136 | Unlike ChIP-seq which often uses chromatin input as controls, ATAC-seq has no control samples. The best way to call differential ATAC-seq peaks between the tumor and normal is to obtain the union of tumor and normal ATAC-seq peaks, extract the read counts from all the 6 samples in the union peaks, then run DESeq2 on them to find differential peaks. SAMTools (http://samtools.sourceforge.net/) and BEDTools (https://bedtools.readthedocs.io/en/latest/) are extremely useful tools to manipulate SAM/BAM and BED files. Let’s try them here. 137 | 138 | 139 | ## 16. For both: 140 | 141 | One way of getting the union peak is to run MACS on each of the samples separately, then use BEDTools to merge the peaks together. E.g. if we use MACS to run peak calling on norm1 (norm1.bed) and leuk1 (leuk1.bed), can you merge the two sets of peaks into one merge.bed file using BEDTools? How many peaks can you return? (Hint: MACS2 FDR cutoff 0.01 on each sample first). 142 | 143 | Hint: 144 | All the bam files are stored under /n/stat115/2020/HW4/Part_VI. 145 | Please refer to https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85853 to verify whether the bam files contain data collected from a normal donor or a leukemic donor. 146 | 147 | ```{r, engine='bash', eval=FALSE} 148 | # your bash code here 149 | ``` 150 | 151 | 152 | ## 17. For GRADUATE students: 153 | 154 | Another way of calling the union peaks is to concatenate all the 6 BAM files together, then run MACS. We have done this already (union.bed). Use BEDTools to calculate the Jaccard index between the union.bed and merge.bed you got in Q1. Jaccard index between set A and set B is defined as (A $\cap$ B)/(A $\cup$ B). 155 | 156 | ```{r, engine='bash', eval=FALSE} 157 | # your bash code here 158 | ``` 159 | 160 | ## 18. For both: 161 | 162 | Extract the reads from the six BAM files in the union.bed peaks. Either the BEDTools multicov function or SAMTools bedcov function can achieve this, and generate a read count matrix on the peaks in the six files. Draw a PCA plot of the resulting matrix. 163 | 164 | ```{r, engine='bash', eval=FALSE} 165 | # your bash code here 166 | ``` 167 | 168 | ## 19. For both: 169 | 170 | Run DESeq2 on the six samples to identify differential ATAC-seq peaks between the 3 leukemia and 3 normal samples. How many peaks are leukemia specific or normal specific at FDR < 0.05? 171 | 172 | ```{r} 173 | # your code here 174 | ``` 175 | 176 | ## 20. For both: 177 | 178 | Take the leukemia-specific ATAC-seq peaks, and run them on Cistrome Toolkit to see what public ChIP-seq have significant overlap with them. What transcription factors might be important in regulating this type of leukemic T cells? 179 | 180 | ## 21. For Graduate Students: 181 | 182 | In Q10, we mentioned that sometimes members of the same transcription factor family have similar binding motifs, similar binding sites (when they are expressed, although they might be expressed in very different tissues), and related functions. Supposedly we don’t have RNA-seq of these samples to calculate the expression level of the TF. However, we can use regulatory potential to assign the ATAC-seq peaks to genes to infer the expression level of a gene (i.e. a gene with many ATAC-seq peaks near its TSS is often expressed at higher level), and see whether the inferred TF might have higher expression in leukemia than normal. Could you describe (not necessarily do it) how to refine the hypothesis on the specific TFs that might regulate this type of leukemic T cells? 183 | 184 | # Rules for submitting the homework: 185 | 186 | Please submit your solution directly on the canvas website. Please 187 | provide both your code in this Rmd document and an html file for your 188 | final write-up. Please pay attention to the clarity and cleanness of 189 | your homework. 190 | 191 | The teaching fellows will grade your homework and give the grades with 192 | feedback through canvas within one week after the due date. Some of the 193 | questions might not have a unique or optimal solution. TFs will grade 194 | those according to your creativity and effort on exploration, especially 195 | in the graduate-level questions. 196 | 197 | -------------------------------------------------------------------------------- /HW5/code/STAT115_HW5_2020.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "STAT 115 Homework 5" 3 | author: "(your name)" 4 | date: "Due: Sunday 4/12/2020 by 11:59 pm" 5 | output: html_document 6 | --- 7 | 8 | # Part I. Hidden Markov Model and TAD boundaries 9 | 10 | Topologically associating domains (TADs) define genomic intervals, where sequences within a TAD physically interact more frequently with each other than with sequences outside the TAD. TADs are often defined by HiC (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3149993/), an experimental technique designed to study the three-dimensional architecture of genomes. HiC generates PE sequenced data, where the two mate pairs indicate two genomic regions that are might be far apart in the genome, but physically interact with each other. If we look across the genome in bins (40kb in the early paper, but now can go down to 5-10kb with deeper sequencing), we could find reads that are mapped there and check whether their interacting mate pairs are mapped upstream or downstream. In each bin, we can calculate a directional index (DI) to quantify the degree of upstream or downstream bias of a given bin (for more details, see the supplement- `Supplement_10.1038_nature11082.pdf` ). For this HW, we ask you to implement a hidden Markov Model (Viterbi) to find regions with upstream bias (DI < 0) and those with downstream bias (DI > 0), even though the DI in individual bins might have some noise. This way, TAD boundaries could be discovered as clusters of bins from negative DIs to positive DIs (see Supplementary Figure 12b). 11 | 12 | For simplicity, we will only have two hidden states (upstream, downstream), and use the following HMM parameters (these do not necessarily capture the real data distribution, but just to help your implementation): 13 | 14 | ``` 15 | Initial probability: upstream = 0.5, downstream = 0.5 16 | Transition probability: Pb(up to up) = Pb(dn to dn) = 0.9, Pb(up to dn) = Pb(dn to up) = 0.1 17 | 18 | Emission probabilities: 19 | P{<-1200, [-1200,-800), [-800,-500), [-500,0), [0,500), [-500,800), [800, 1200), >= 1200 | upstream} = (0.01, 0.01, 0.02, 0.04, 0.65, 0.15, 0.08, 0.04) 20 | P{<-1200, [-1200,-800), [-800,-500), [-500,0), [0,500), [-500,800), [800, 1200), >= 1200 | downstream} = (0.04, 0.08, 0.15, 0.65, 0.04, 0.02, 0.01, 0.01) 21 | 22 | ``` 23 | 24 | #### I.1 Given the DI file (`HW5_ESC.Dixon_2015.DI.chr21.txt`), implement and utilize the Viterbi algorithm to predict the hidden states of the Hi-C data. Visualize your result with a graph utilizing the following: midpoint of genomic bin on the x axis; DI score per bin on the y-axis; color: hidden state of the HMM. 25 | 26 | Hint: Examples HMM code can be found at: 27 | http://www.adeveloperdiary.com/data-science/machine-learning/implement-viterbi-algorithm-in-hidden-markov-model-using-python-and-r/ 28 | 29 | 30 | ```{r} 31 | 32 | data <- read.table("../data/HW5_ESC.Dixon_2015.DI.chr21.txt", col.names = c("chr", "start", "end", "DI")) 33 | data$mid <- (data$end + data$start)/ 2 34 | 35 | # Hint: make discrete states from the continuous directionability index 36 | obs_states <- cut(data$DI, breaks = c(min(data$DI)-1,-1200, -800, -500, 0, 500, 800, 1200, max(data$DI)+1), right = FALSE) 37 | 38 | ``` 39 | 40 | # Part II. Single cell ATAC-seq 41 | 42 | For this exercise, we will be analyzing a single cell ATAC-Seq dataset of human peripheral blood mononuclear cells (PBMC) from the 10X Genomics platform. There are around 5,000 single cells that were sequenced on the Illumina NovaSeq. The raw data can be found at: https://support.10xgenomics.com/single-cell-atac/datasets/1.2.0/atac_pbmc_5k_v1. A processed Seurat scRNA-seq object used in the lab will be reused for the assignment and is available here: https://github.com/stat115/Lab_2020/blob/master/Lab09/scrna_source/output/PBMC5k_scRNAseq-for-integration.rds. 43 | 44 | 45 | #### II.1 Read the 10X data and create a Seurat object that stores the reads in peaks count matrix. Filter cells with fewer than 5000 counts (from the `passed_filters` variable) How many cells are retained and how many are excluded? 46 | 47 | 48 | #### II.2 Quantify the gene activity for each cell using the `FeatureMatrix` function from Signac. Include your code below. 49 | 50 | 51 | #### II.3 Process the gene activity matrix by scaling and normalizing using Signac (`NormalizeData()`) 52 | 53 | 54 | #### II.4 Process the peak matrix. a) Perform latent semantic indexing (LSI) to reduce the dimensionality of the scATAC-seq data. Reduce the dimension to 50. b) Run UMAP on the first 20 dimensions but excluding the first component. c) Cluster all the cells using `resolution = 0.6` and visualize these clusters on a UMAP embedding. Comment on why we recommended excluding the first LSI component. 55 | 56 | 57 | #### II.5 Read in the pre-processed and clustered scRNA-seq dataset, which is provided as part of the homework and was generated for the lab exercise. Then identify anchors between the scATAC-seq dataset and the scRNA-seq dataset and use these anchors to transfer cell type labels from scRNA-seq to scATAC-seq cells. Visualize the predicted cell types on the UMAP plot of scATAC-seq data. 58 | 59 | 60 | #### II.6 [Graduate Students] Create a matrix heatmap of cluster IDs from the Seurat clusters from scATAC data with the predicted celltypes from scRNA-seq. Describe what clusters appears to map 1 to 1 between the modalities and which clusters appear split? (Hint: use the `pbmc@meta.data` data frame and `dplyr::group_by`) 61 | 62 | 63 | #### II.7 [Graduate Students] Using the transferred cell state annotations, find the differential peaks between the two clusters of B-cells (activated and memory). Visualize two of the top accessibility peaks that are different. Are the accessibility peaks visualized restricted to a particular celltype or present in other PBMC celltypes? 64 | 65 | 66 | #### II.8 [Graduate Students] Perform a motif analysis to identify motifs that are over-represented in the differential peaks between the activated and memory B-cells. Visualize the top two motifs that are differential between the B-cells. 67 | 68 | 69 | # Part III: GWAS Followup 70 | 71 | The NHGRI-EBI GWAS Catalog is a curated dataset of trait-associated genetic variants for human. While it provides association between single-nucleotide polymorphisms (SNPs) and trait (i.e. cancer), the genetic variants in GWAS catalog are not necessarily causative or functional for a trait, since SNPs can be highly correlated measured by linkage disequilibrium (LD). To learn the potential functional effect of a certain SNP, especially the non-coding variants, we can use RegulomeDB to explore the potential function of the SNP. 72 | 73 | You will explore the following online resources: The NHGRI-EBI GWAS catalog (https://www.ebi.ac.uk/gwas/), dbSNP (https://www.ncbi.nlm.nih.gov/snp/ ), LDLink (https://ldlink.nci.nih.gov/), and RegulomeDB (the beta version http://regulomedb.org or the more stable older version http://legacy.regulomedb.org/). 74 | 75 | #### III.1 Explore whether there are genetic variants within the gene BRCA2 which are associated with any traits. What traits are associated with the BRCA2 variants? Which SNP has the smallest p-value related to breast cancer? What is the risk allele? 76 | 77 | 78 | #### III.2 For the BRCA2 SNP with most significant association with breast cancer, what consequence does the risk allele have on the BRCA2 protein sequence? Based on 1000 Genomes in LDLink, what is the allele frequency of the risk allele among the 5 ethnicities In the population with the highest risk in the resource, what is the expected number of people with heterozygous genotype at this SNP, assuming linkage disequilibrium? 79 | 80 | 81 | #### III.3 Explore a certain SNP, rs4784227, that was reported to be associated with breast cancer. Is it an intergenic, exonic or intronic variant? What gene does it fall in? 82 | 83 | #### III.4 Explore the SNP rs4784227 in RegulomeDB. What functional category does the rank score (or Regulome DB Score) implicate? What factors does RegulomeDB take into consideration while scoring the potential function of SNPs? 84 | 85 | 86 | #### III.5 Describe the evidence that implicate the regulatory potential of rs4784227, for example, list several transcription factors with binding peaks overlapping this SNP; report the cell types with open chromatin regions overlapping this SNP. 87 | 88 | 89 | #### III.6 [Graduate Students] Read the paper by Cowper-Sal et al. (PMID 23001124) and summarize the potential mechanisms of the above SNP’s function in terms of affecting transcription factor-DNA interaction and regulating genes. 90 | 91 | 92 | # Part IV: COVID19 Genomics 93 | 94 | We are currently fighting an epidemic due to the SARS-CoV-2 virus. As more viruses from infected individuals are sequenced, the epidemiology of this pathogen is becoming better understood. Nextstrain (https://nextstrain.org/ncov) is an online resource that aggregates and tracks public sequencing data of the virus. Using screen shots to support your answers, address the following questions related to SARS-CoV-2: 95 | 96 | #### IV.1 Determine the main clades of the virus as well as the main nucleotide and protein changes that define the clades. What are the genes associated with each mutation? 97 | 98 | 99 | #### IV.2 Identify the main clade affecting four of the countries most severely affected by SAR-CoV-2: China, United States, Iran, and Italy. 100 | 101 | 102 | #### IV.3 The countries of Georgia, Democratic Republic of Congo, and Brazil have relatively few (but non-zero!) cases of SARS-CoV-2. Using the Nextstrain data, speculate the most likely countries where the virus was transmitted from. 103 | 104 | 105 | #### IV.4 The spike protein (S) is currently the target of several therapeutic approaches and vaccines. Understanding which cases have mutated residues of this protein is of considerable importance. For the variant in this protein with the highest minor allele frequency, visualize the proportion of these cases of the virus world-wide. 106 | 107 | 108 | #### IV.5 [Graduate Students] Preliminary reports from the New England Journal of Medicine (`nejmoa2002032.pdf`) suggest that men may be more susceptible than women. Using the metadata from Nextstrain, evaluate whether you can corroborate this finding. Further, determine whether the clade of the virus differentially affects men or women. Support your answers with statistical analyses. 109 | 110 | 111 | #### IV.6 [Graduate Students] For each country in the data reported from NextStrain, determine the clade that is responsible for the most cases and the percent of cases (per country). 112 | 113 | 114 | -------------------------------------------------------------------------------- /HW5/papers/PMID23001124.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW5/papers/PMID23001124.pdf -------------------------------------------------------------------------------- /HW5/papers/Supplement_10.1038_nature11082.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW5/papers/Supplement_10.1038_nature11082.pdf -------------------------------------------------------------------------------- /HW5/papers/nejmoa2002032.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW5/papers/nejmoa2002032.pdf -------------------------------------------------------------------------------- /HW6/README.md: -------------------------------------------------------------------------------- 1 | ## Homework6 2 | Due: April 29 @11:59pm 3 | -------------------------------------------------------------------------------- /HW6/Stat115_Homework6.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Stat 115 2020: Homework 6' 3 | author: '(Your Name Here)' 4 | date: "Due: Wed, April 29, 2020 at 11:59pm" 5 | output: html_document 6 | --- 7 | 8 | ```{r} 9 | #Load packages that you might use 10 | 11 | #Run `devtools::install_github("mariodeng/FirebrowseR")` if not installed 12 | library(FirebrowseR) 13 | library(limma) 14 | library(ggplot2) 15 | library(scales) 16 | library(survival) 17 | library(magrittr) 18 | library(data.table) 19 | library(knitr) 20 | library(glmnet, quietly = TRUE) 21 | library(MAGeCKFlute) 22 | library(dplyr) 23 | library(tidyr) 24 | library(biobroom) 25 | library(survminer) 26 | ``` 27 | 28 | 29 | # Part I. Data exploration on TCGA 30 | 31 | The Cancer Genome Atlas (TCGA) is an NCI project to comprehensively profile > 10K tumors in 33 cancer types. In this homework, we are going to explore TCGA data analysis. 32 | 33 | ## 1. For both: 34 | 35 | Go to TCGA GDC website (https://portal.gdc.cancer.gov/) and explore the GDC data portal. How many glioblastoma (GBM) cases in TCGA meet ALL of the following requirements? 36 | 1. Male; 37 | 2. Diagnosed at the age above 45; 38 | 3. Still alive. 39 | 40 | 41 | ## 2. For both: 42 | 43 | TCGA GDC (https://portal.gdc.cancer.gov/) and Broad Firehose (http://firebrowse.org/) both provide processed TCGA data for downloading and downstream analysis. Download clinical data of GBM. What’s the average diagnosed age of all GBM patients? 44 | 45 | ```{r} 46 | # your code here 47 | 48 | ``` 49 | 50 | 51 | 52 | # Part II. Tumor Subtypes 53 | 54 | You are given a number of TCGA glioblastoma (GBM) samples and 10 commercially available normal brains (it is unethical to take matched normal brain from GBM tumor patients), including their expression, DNA methylation, mutation profiles as well as patient survival. Please note that we only selected a subset of the samples to make this HW, which were simplified to give students a flavor of cancer genomics studies, so some findings from these data might not reflect the real biology of GBM. 55 | 56 | 57 | ## 1. For both: 58 | 59 | GBM is one of the earliest cancer types to be processed by TCGA, and the expression profiling was initially done with Affymetrix microarray. Also, with brain cancer, it is hard to get sufficient number of normal samples. We provide the pre-processed expression matrix in (GBM_expr.txt) where samples are columns and genes are rows. Do a K-means (k=3) clustering from all the genes and the most variable 2000 genes. Do tumor and normal samples separate in different clusters? Do the tumors samples consistently separate into 2 clusters, regardless of whether you use all the genes or most variable genes? 60 | 61 | ```{r} 62 | # your code here 63 | 64 | ``` 65 | 66 | ## 2. For both: 67 | 68 | LIMMA is a BioConductor package that does differential expression between microarrays, RNA-seq, and can remove batch effects (especially if you have experimental design with complex batches). Use LIMMA to see how many genes are differentially expressed between the two GBM subtypes (with FDR < 0.05 and logFC > 1.5)? 69 | 70 | ```{r} 71 | # your code here 72 | 73 | ``` 74 | 75 | ## 3. For GRADUATE students: 76 | 77 | From the DNA methylation profiles (GBM_meth.txt), what are the genes significantly differentially methylated between the two subtypes? Are DNA methylation associated with higher or lower expression of these genes? How many differentially expressed genes have an epigenetic (DNA methylation) cause (i.e. how many differentially expressed genes are also differentially methylated)? 78 | 79 | ```{r} 80 | # your code here 81 | 82 | ``` 83 | 84 | ## 4. For both: 85 | 86 | With the survival data of the GBM tumors (GBM_clin.txt), make a Kaplan-Meier Curve to compare the two subtypes of GBM patients. Is there a significant difference in patient outcome between the two subtypes? 87 | 88 | ```{r} 89 | # your code here 90 | 91 | ``` 92 | 93 | ## 5. For GRADUATE students: 94 | 95 | Use the differential genes (say this is Y number of genes) between the two GBM subtypes as a gene signature to do a Cox regression of the tumor samples. Does it give significant predictive power of patient outcome? 96 | 97 | ```{r} 98 | # your code here 99 | 100 | ``` 101 | 102 | ## 6. For GRADUATE students: 103 | 104 | Many studies use gene signatures to predict prognosis of patients. Take a look at this paper: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002240. 105 | It turns out that most published gene signatures are not significantly more associated with outcome than random predictors. 106 | 107 | Write a script to randomly sample Y genes in this expression data as a gene signature and do Cox regression on the sampled signature to predict patient outcome. Automate the script and random sample followed by Cox regression 100 times. How does your signature in Q5 compared to random signatures in predicting outcome? 108 | 109 | ```{r} 110 | # your code here 111 | 112 | ``` 113 | 114 | # Part III. Tumor mutation analyses and precision medicine 115 | 116 | ## 1. For both: 117 | 118 | The MAF files contain the mutations of each tumor compared to the normal DNA in the patient blood. Write a script to parse out the mutations present in each tumor sample and write out a table. The table should rank the genes by how many times mutation happens in the tumor samples provided. Submit the table with the top 20 genes. 119 | 120 | ```{r} 121 | # your code here 122 | 123 | ``` 124 | 125 | ## 2. For both: 126 | 127 | Existing clinical genetic testing laboratories use information about the frequency of a mutation in cohorts, like from the GBM cohort in TCGA, to assess a mutation’s clinical significance (guidelines: https://www.ncbi.nlm.nih.gov/pubmed/27993330). Of the top 20 genes in Q1, what gene has the mutation seen the most times (hint: count mutations with the exact same amino acid change as the same)? Do you think this mutation forms a genetic subtype of GBM? 128 | 129 | ```{r} 130 | # your code here 131 | 132 | ``` 133 | 134 | ## 3. For both: 135 | 136 | CBioPortal has a comprehensive list of tumor profiling results for interactive visualization. Go to cBioPortal (http://www.cbioportal.org), and select either “Glioblastoma” under “CNS/Brian” (left) or select “TCGA PanCancer Atlas Studies” under “Quick Select” (middle). Input each gene in Q1 and click Submit. From the OncoPrint tab, you can see how often each gene is mutated in GBM or all TCGA cancer types. Based on this, which of the genes in Part3 Q1 is likely to be a cancer driver gene? 137 | 138 | ## 4. For both: 139 | 140 | From the Mutation tab on the cBioPortal result page, is this mutation a gain or loss of function mutation on the gene you identified from Part3 Q2? 141 | 142 | ## 5. For both: 143 | 144 | From cBioPortal, select Glioblastoma (TCGA provisional, which has the largest number of samples) and enter the driver mutation gene in Q2. From the Survival tab, do GBM patients with this mutation have better outcome in terms of progression free survival and overall survival? 145 | 146 | ## 6. For both: 147 | 148 | You are working with an oncologist collaborator to decide the treatment option for a GBM patient. From exome-seq of the tumor, you identified the top mutation in Part3 Q2. To find out whether there are drugs that can target this mutation to treat the cancer, go to https://www.clinicaltrials.gov to find clinical trials that target the gene in Q2. How many trials are related to glioblastoma? How many of these are actively recruiting patients which this patient could potentially join? 149 | Hint: Search by the disease name and gene name. The file containing all the trials can be exported as a .tsv file. 150 | 151 | ```{r} 152 | # your code here 153 | 154 | ``` 155 | 156 | 157 | # Part IV. CRISPR screens 158 | 159 | We will learn to analyze CRISPR screen data from this paper: https://www.ncbi.nlm.nih.gov/pubmed/?term=26673326. To identify therapeutic targets for glioblastoma (GBM), the author performed genome-wide CRISPR-Cas9 knockout (KO) screens in patient-derived GBM stem-like cell line (GSCs0131). 160 | 161 | MAGeCK tutorial: 162 | https://sourceforge.net/p/mageck/wiki/Home/ 163 | https://sourceforge.net/projects/mageck/ 164 | 165 | The data for the CRISPR screen is stored at /n/stat115/2020/HW6/crispr_data. There are 4 gzipped fastq files (ending in fastq.gz) which store the data, and a library.csv library file for the sgRNAs. 166 | 167 | ## 1. For both: 168 | 169 | Use MAGeCK to do a basic QC of the CRISPR screen data (e.g. read mapping, ribosomal gene selection, replicate consistency, etc). 170 | 171 | ```{r} 172 | # your code here 173 | 174 | ``` 175 | 176 | ## 2. For both: 177 | 178 | Analyze CRISPR screen data with MAGeCK to identify positive and negative selection genes. How many genes are selected as positive or negative selection genes, respectively, and what are their respective enriched pathways? 179 | 180 | ```{r} 181 | # your code here 182 | 183 | ``` 184 | 185 | ## 3. For GRADUATE students: 186 | 187 | Genes negatively selected in this CRISPR screen could be potential drug targets. However, if they are always negatively selected in many cells, targeting such genes might create too much toxicity to the normal cells. Go to depmap (DepMap.org) which has CRISPR / RNAi screens of over 500 human cell lines, Click “Tools” -> Data Explorer. Pick the top 3 negatively selected genes to explore. Select Gene Dependency from CRISPR (Avana) on the X axis and Omics from Expression on the Y axis, to see the relationship between the expression level of the gene and dependency (CRISPR screen selection) of the gene across ~500 cell lines. Are the top 3 genes good drug targets? 188 | 189 | ```{r} 190 | # your code here 191 | 192 | ``` 193 | 194 | ## 4. For GRADUATE students: 195 | 196 | Let’s filter out pan essential genes (PanEssential.txt) from the negatively selected genes in Q2. Take the remaining top 10 genes, and check whether those genes have drugs or are druggable from this website: http://www.oasis-genomics.org/. Go to Analysis -> Pan Cancer Report, enter the top 10 genes and check the table for druggability (more druggable for higher number on Dr). Which of these genes are druggable? 197 | 198 | PanEssential.txt is stored at /n/stat115/2020/HW6/crispr_data. 199 | 200 | ```{r} 201 | # your code here 202 | 203 | ``` 204 | 205 | # PART V. Cancer immunology and immunotherapy 206 | 207 | Immune checkpoint inhibitors, which primarily activate CD8 T cells, have shown remarkable efficacy in melanoma (SKCM), but haven’t worked as well in GBM patients. Let’s explore the tumor immune microenvironment from TCGA data. Although the cancer patients in TCGA were not treated with immunotherapy, their response to other drugs and clinical outcome might be influenced by pre-treatment tumor immune microenvironment. 208 | 209 | ## 1. For both: 210 | 211 | TIMER (http://timer.cistrome.org/) estimated the infiltration level of different immune cells of TCGA tumors using different immune deconvolution methods. CD8A and CD8B are two gene markers on CD8 T cells. On the Diff Exp tab, compare the expression level of either CD8A or CD8B between GBM and SKCM (Metastatic Melanoma). Based on this, which cancer type have more CD8 T cells? 212 | 213 | ## 2. For both: 214 | 215 | On the Gene tab, select both GBM and SKCM (Metastatic Melanoma), include all 6 immune cell infiltrates. Check the following genes, PD1, PDL1, CTLA4 which are the targets of immune checkpoint inhibitors, to see whether their expression level is associated with immune cell infiltration in the GBM and SKCM tumors. Their higher expression usually indicate that T cells are in a dysfunctional state, which immune checkpoint inhibitors aim to revive. 216 | 217 | ## 3. For both: 218 | 219 | On the Survival tab, select both GBM and SKCM, include all 6 immune cell infiltrates, add tumor stage and patient age as the clinical variables to conduct survival analyses. Based on the Cox PH model, what factors are the most significantly associated with patient survival in each cancer type? Plot the Kaplan-Meire curve to evaluate how each immune cell infiltrate is associated with survival. Which cells are associated with patient survival in which cancer type? 220 | 221 | ## 4. For GRADUATE students: 222 | 223 | Based on the above observations, can you hypothesize why immune checkpoint inhibitors don’t work well for some GBM patients? 224 | 225 | # Rules for submitting the homework: 226 | 227 | Please submit your solution directly on the canvas website. Please 228 | provide both your code in this Rmd document and an html file for your 229 | final write-up. Please pay attention to the clarity and cleanness of 230 | your homework. 231 | 232 | The teaching fellows will grade your homework and give the grades with 233 | feedback through canvas within one week after the due date. Some of the 234 | questions might not have a unique or optimal solution. TFs will grade 235 | those according to your creativity and effort on exploration, especially 236 | in the graduate-level questions. 237 | 238 | -------------------------------------------------------------------------------- /HW6/data/GBM_clin.txt: -------------------------------------------------------------------------------- 1 | vital.status days.to.death days.to.last.followup 2 | TCGA-02-0025 1 1300 1300 3 | TCGA-02-0026 1 748 748 4 | TCGA-02-0080 1 2729 2729 5 | TCGA-02-0084 1 384 7 6 | TCGA-02-0085 1 1561 1561 7 | TCGA-02-0087 0 NA 1757 8 | TCGA-02-0104 1 1977 1977 9 | TCGA-02-0114 1 3041 3041 10 | TCGA-02-0116 1 1489 1489 11 | TCGA-02-0258 1 503 503 12 | TCGA-02-2483 0 NA 466 13 | TCGA-06-0124 1 620 123 14 | TCGA-06-0128 1 691 691 15 | TCGA-06-0129 1 1024 989 16 | TCGA-06-0146 1 611 611 17 | TCGA-06-0164 1 1731 1730 18 | TCGA-06-0194 1 142 142 19 | TCGA-06-0201 1 12 12 20 | TCGA-06-0210 1 225 151 21 | TCGA-06-0397 1 274 168 22 | TCGA-06-1805 0 NA 1031 23 | TCGA-06-2570 0 NA 958 24 | TCGA-06-5410 1 108 108 25 | TCGA-06-5412 1 138 138 26 | TCGA-06-5417 0 NA 155 27 | TCGA-06-6389 0 NA 237 28 | TCGA-08-0344 1 3524 3310 29 | TCGA-08-0346 1 256 132 30 | TCGA-08-0350 1 889 104 31 | TCGA-08-0373 1 134 134 32 | TCGA-08-0509 1 382 17 33 | TCGA-08-0510 1 130 107 34 | TCGA-12-0620 1 318 181 35 | TCGA-12-0772 1 1638 1615 36 | TCGA-12-0775 1 232 167 37 | TCGA-12-0818 1 2791 2791 38 | TCGA-12-0827 1 1179 1179 39 | TCGA-12-1090 1 231 190 40 | TCGA-14-0783 1 189 189 41 | TCGA-14-1456 0 NA 1246 42 | TCGA-14-1821 1 541 541 43 | TCGA-14-4157 0 NA 104 44 | TCGA-16-0849 0 NA 793 45 | TCGA-16-0850 1 498 496 46 | TCGA-16-1060 1 278 111 47 | TCGA-16-1460 0 NA 195 48 | TCGA-19-0962 1 20 13 49 | TCGA-19-1389 1 141 141 50 | TCGA-19-1790 1 154 154 51 | TCGA-19-2629 1 737 501 52 | TCGA-26-1442 0 NA 953 53 | TCGA-26-5133 0 NA 452 54 | TCGA-27-2521 0 NA 316 55 | TCGA-28-1756 0 NA 86 56 | TCGA-28-5209 0 NA 442 57 | TCGA-28-5218 1 157 128 58 | TCGA-32-4208 0 NA 643 59 | TCGA-32-4209 1 618 604 60 | TCGA-32-4213 0 NA 604 61 | TCGA-41-3393 1 135 135 62 | -------------------------------------------------------------------------------- /HW6/data/TCGA-06-5410.maf.txt: -------------------------------------------------------------------------------- 1 | Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_position End_position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_file Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID Genome_Change Annotation_Transcript Transcript_Strand Transcript_Exon Transcript_Position cDNA_Change Codon_Change Protein_Change Other_Transcripts Refseq_mRNA_Id Refseq_prot_Id SwissProt_acc_Id SwissProt_entry_Id Description UniProt_AApos UniProt_Region UniProt_Site UniProt_Natural_Variations UniProt_Experimental_Info GO_Biological_Process GO_Cellular_Component GO_Molecular_Function COSMIC_overlapping_mutations COSMIC_fusion_genes COSMIC_tissue_types_affected COSMIC_total_alterations_in_gene Tumorscape_Amplification_Peaks Tumorscape_Deletion_Peaks TCGAscape_Amplification_Peaks TCGAscape_Deletion_Peaks DrugBank ref_context gc_content CCLE_ONCOMAP_overlapping_mutations CCLE_ONCOMAP_total_mutations_in_gene CGC_Mutation_Type CGC_Translocation_Partner CGC_Tumor_Types_Somatic CGC_Tumor_Types_Germline CGC_Other_Diseases DNARepairGenes_Role FamilialCancerDatabase_Syndromes MUTSIG_Published_Results OREGANNO_ID OREGANNO_Values 2 | SPTA1 6708 broad.mit.edu 37 1 158592861 158592861 + Missense_Mutation SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr1:158592861G>A uc001fst.1 - 43 6231 c.6032C>T c.(6031-6033)GCC>GTC p.A2011V NM_003126 NP_003117 P02549 SPTA1_HUMAN spectrin, alpha, erythrocytic 1 2011 Spectrin 19. actin filament capping|actin filament organization|axon guidance|regulation of cell shape cytosol|intrinsic to internal side of plasma membrane|spectrin|spectrin-associated cytoskeleton actin filament binding|calcium ion binding|structural constituent of cytoskeleton ovary(4)|skin(2)|upper_aerodigestive_tract(1)|breast(1) 8 all_hematologic(112;0.0378) CAGCAGAGCGGCATAACGCTC 0.483 3 | C1orf112 55732 broad.mit.edu 37 1 169772375 169772375 + Silent SNP C T T TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr1:169772375C>T uc001ggp.2 + 6 547 c.237C>T c.(235-237)TCC>TCT p.S79S C1orf112_uc001ggj.2_RNA|C1orf112_uc001ggo.2_Silent_p.S79S|C1orf112_uc001ggq.2_Silent_p.S79S|C1orf112_uc009wvt.2_5'UTR|C1orf112_uc010plu.1_Silent_p.S50S|C1orf112_uc009wvu.1_Silent_p.S50S|C1orf112_uc001ggr.2_5'UTR|C1orf112_uc010plv.1_Silent_p.S21S NM_018186 NP_060656 Q9NSG2 CA112_HUMAN hypothetical protein LOC55732 79 0 all_hematologic(923;0.0922)|Acute lymphoblastic leukemia(37;0.181) CACAGGAATCCATCATTTTGG 0.363 4 | DLG5 9231 broad.mit.edu 37 10 79566617 79566617 + Silent SNP C A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr10:79566617C>A uc001jzk.2 - 26 4936 c.4866G>T c.(4864-4866)GTG>GTT p.V1622V DLG5_uc001jzi.2_Silent_p.V377V|DLG5_uc001jzj.2_Silent_p.V1037V|DLG5_uc009xru.1_RNA NM_004747 NP_004738 Q8TDM6 DLG5_HUMAN discs large homolog 5 1622 SH3. cell-cell adhesion|intracellular signal transduction|negative regulation of cell proliferation|regulation of apoptosis cell junction|cytoplasm beta-catenin binding|cytoskeletal protein binding|receptor signaling complex scaffold activity ovary(5)|breast(3) 8 all_cancers(46;0.0316)|all_epithelial(25;0.00147)|Breast(12;0.0015)|Prostate(51;0.0146) Epithelial(14;0.00105)|OV - Ovarian serous cystadenocarcinoma(4;0.00151)|all cancers(16;0.00446) AGGTGTCATCCACGTAGAGGA 0.572 5 | OR8H2 390151 broad.mit.edu 37 11 55873242 55873242 + Missense_Mutation SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr11:55873242G>A uc010riy.1 + 1 724 c.724G>A c.(724-726)GTC>ATC p.V242I NM_001005200 NP_001005200 Q8N162 OR8H2_HUMAN olfactory receptor, family 8, subfamily H, 242 Helical; Name=6; (Potential). sensory perception of smell integral to membrane|plasma membrane olfactory receptor activity ovary(1)|skin(1) 2 Esophageal squamous(21;0.00693) CTCTACTTGCGTCTCTCATCT 0.383 HNSCC(53;0.14) 6 | GLYATL2 219970 broad.mit.edu 37 11 58602091 58602091 + Silent SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr11:58602091G>A uc001nnd.3 - 6 827 c.696C>T c.(694-696)TAC>TAT p.Y232Y GLYATL2_uc009ymq.2_Silent_p.Y232Y NM_145016 NP_659453 Q8WU03 GLYL2_HUMAN glycine-N-acyltransferase-like 2 232 mitochondrion glycine N-acyltransferase activity ovary(1)|skin(1) 2 Breast(21;0.0044)|all_epithelial(135;0.0216) Glycine(DB00145) CTTGGTGTCTGTATTTGGGGA 0.413 7 | CTTN 2017 broad.mit.edu 37 11 70279266 70279266 + Silent SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr11:70279266G>A uc001opv.3 + 16 1532 c.1326G>A c.(1324-1326)CCG>CCA p.P442P CTTN_uc001opu.2_Silent_p.P405P|CTTN_uc001opw.3_Silent_p.P405P|CTTN_uc010rqm.1_Silent_p.P126P|CTTN_uc001opx.2_Silent_p.P126P NM_005231 NP_005222 Q14247 SRC8_HUMAN cortactin isoform a 442 cell cortex|cytoskeleton|lamellipodium|ruffle|soluble fraction protein binding ovary(1) 1 BRCA - Breast invasive adenocarcinoma(2;4.34e-41)|LUSC - Lung squamous cell carcinoma(11;1.51e-13)|STAD - Stomach adenocarcinoma(18;0.0513) Lung(977;0.0234)|LUSC - Lung squamous cell carcinoma(976;0.133) GGACGGAGCCGGAGCCCGTGT 0.652 8 | DYNC2H1 79659 broad.mit.edu 37 11 103014114 103014114 + Nonsense_Mutation SNP C T T TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr11:103014114C>T uc001pho.2 + 18 2836 c.2692C>T c.(2692-2694)CGA>TGA p.R898* DYNC2H1_uc001phn.1_Nonsense_Mutation_p.R898*|DYNC2H1_uc009yxe.1_Intron NM_001080463 NP_001073932 Q8NCM8 DYHC2_HUMAN dynein, cytoplasmic 2, heavy chain 1 898 Stem (By similarity). cell projection organization|Golgi organization|microtubule-based movement|multicellular organismal development cilium axoneme|dynein complex|Golgi apparatus|microtubule|plasma membrane ATP binding|ATPase activity|microtubule motor activity 0 Acute lymphoblastic leukemia(157;0.000966)|all_hematologic(158;0.00348) BRCA - Breast invasive adenocarcinoma(274;0.000177)|Epithelial(105;0.0785) AGAAGTAGAACGACTTCCAAG 0.363 9 | BCL2L14 79370 broad.mit.edu 37 12 12232401 12232401 + Silent SNP C T T TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr12:12232401C>T uc001rac.2 + 2 363 c.162C>T c.(160-162)TCC>TCT p.S54S ETV6_uc001raa.1_Intron|BCL2L14_uc001raf.1_RNA|BCL2L14_uc001rad.2_Silent_p.S54S|BCL2L14_uc001rae.2_Silent_p.S54S NM_138723 NP_620049 Q9BZR8 B2L14_HUMAN BCL2-like 14 isoform 1 54 apoptosis|regulation of apoptosis cytosol|endomembrane system|intracellular organelle|membrane protein binding p.S54S(1) skin(1) 1 Prostate(47;0.0872) BRCA - Breast invasive adenocarcinoma(232;0.154) GAAGTTTGTCCCAGAGGGGCC 0.488 10 | LIMA1 51474 broad.mit.edu 37 12 50575756 50575756 + Missense_Mutation SNP C T T TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr12:50575756C>T uc001rwj.3 - 10 1379 c.1205G>A c.(1204-1206)CGT>CAT p.R402H LIMA1_uc001rwg.3_Missense_Mutation_p.R100H|LIMA1_uc001rwh.3_Missense_Mutation_p.R241H|LIMA1_uc001rwi.3_Missense_Mutation_p.R243H|LIMA1_uc001rwk.3_Missense_Mutation_p.R403H|LIMA1_uc010smr.1_RNA|LIMA1_uc010sms.1_RNA NM_016357 NP_057441 Q9UHB6 LIMA1_HUMAN LIM domain and actin binding 1 isoform b 402 LIM zinc-binding. actin filament bundle assembly|negative regulation of actin filament depolymerization|ruffle organization cytoplasm|focal adhesion|stress fiber actin filament binding|actin monomer binding|zinc ion binding ovary(1) 1 GGCCAAGAGACGCTCCATTGG 0.473 11 | DGKA 1606 broad.mit.edu 37 12 56330335 56330335 + Silent SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr12:56330335G>A uc001sij.2 + 2 312 c.48G>A c.(46-48)CTG>CTA p.L16L DGKA_uc009zoc.1_Silent_p.L16L|DGKA_uc001sih.1_5'UTR|DGKA_uc001sii.1_5'UTR|DGKA_uc009zod.1_Silent_p.L16L|DGKA_uc009zoe.1_Silent_p.L16L|DGKA_uc001sik.2_Silent_p.L16L|DGKA_uc001sil.2_Silent_p.L16L|DGKA_uc001sim.2_Silent_p.L16L|DGKA_uc001sin.2_Silent_p.L16L|DGKA_uc009zof.2_5'UTR|DGKA_uc001sio.2_5'UTR NM_001345 NP_001336 P23743 DGKA_HUMAN diacylglycerol kinase, alpha 80kDa 16 activation of protein kinase C activity by G-protein coupled receptor protein signaling pathway|intracellular signal transduction|platelet activation plasma membrane ATP binding|calcium ion binding|diacylglycerol kinase activity ovary(3)|pancreas(1) 4 Vitamin E(DB00163) TTGCCCAGCTGCAAAAATACA 0.527 12 | FREM2 341640 broad.mit.edu 37 13 39266205 39266205 + Missense_Mutation SNP T G G TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr13:39266205T>G uc001uwv.2 + 1 5033 c.4724T>G c.(4723-4725)GTG>GGG p.V1575G NM_207361 NP_997244 Q5SZK8 FREM2_HUMAN FRAS1-related extracellular matrix protein 2 1575 Extracellular (Potential).|CSPG 11. cell communication|homophilic cell adhesion|multicellular organismal development integral to membrane|plasma membrane calcium ion binding ovary(7)|pancreas(1)|haematopoietic_and_lymphoid_tissue(1)|central_nervous_system(1)|skin(1) 11 Lung NSC(96;1.04e-07)|Prostate(109;0.00384)|Breast(139;0.00396)|Lung SC(185;0.0565)|Hepatocellular(188;0.114) all cancers(112;3.32e-07)|Epithelial(112;1.66e-05)|OV - Ovarian serous cystadenocarcinoma(117;0.00154)|BRCA - Breast invasive adenocarcinoma(63;0.00631)|GBM - Glioblastoma multiforme(144;0.0312) ATCACCCAGGTGCCTATTCAT 0.418 13 | CHD8 57680 broad.mit.edu 37 14 21871325 21871325 + Nonsense_Mutation SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr14:21871325G>A uc001was.1 - 18 2822 c.2728C>T c.(2728-2730)CAG>TAG p.Q910* CHD8_uc001war.1_Nonsense_Mutation_p.Q806*|CHD8_uc001wav.1_Nonsense_Mutation_p.Q352* NM_020920 NP_065971 Q9HCK8 CHD8_HUMAN chromodomain helicase DNA binding protein 8 1189 Helicase C-terminal. ATP-dependent chromatin remodeling|canonical Wnt receptor signaling pathway|negative regulation of transcription, DNA-dependent|negative regulation of Wnt receptor signaling pathway|positive regulation of transcription from RNA polymerase II promoter|positive regulation of transcription from RNA polymerase III promoter|transcription, DNA-dependent MLL1 complex ATP binding|beta-catenin binding|DNA binding|DNA helicase activity|DNA-dependent ATPase activity|methylated histone residue binding|p53 binding ovary(6)|upper_aerodigestive_tract(1)|large_intestine(1)|breast(1)|skin(1) 10 all_cancers(95;0.00121) Epithelial(56;2.55e-06)|all cancers(55;1.73e-05) GBM - Glioblastoma multiforme(265;0.00424) ATGGCAGCCTGTCGAAGGTTG 0.478 14 | LRFN5 145581 broad.mit.edu 37 14 42360496 42360496 + Missense_Mutation SNP G C C TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr14:42360496G>C uc001wvm.2 + 4 2627 c.1429G>C c.(1429-1431)GCT>CCT p.A477P LRFN5_uc010ana.2_Intron NM_152447 NP_689660 Q96NI6 LRFN5_HUMAN leucine rich repeat and fibronectin type III 477 Extracellular (Potential).|Fibronectin type-III. integral to membrane ovary(5)|pancreas(2)|central_nervous_system(1) 8 LUAD - Lung adenocarcinoma(50;0.0223)|Lung(238;0.0728) GBM - Glioblastoma multiforme(112;0.00847) CAATAATCTGGCTGCTGGAAC 0.403 HNSCC(30;0.082) 15 | IL32 9235 broad.mit.edu 37 16 3119304 3119305 + Frame_Shift_Ins INS - G G rs2981599 TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr16:3119304_3119305insG uc002cto.2 + 6 864_865 c.653_654insG c.(652-654)GACfs p.D218fs IL32_uc002ctk.2_Frame_Shift_Ins_p.D115fs|IL32_uc010uwp.1_Frame_Shift_Ins_p.D152fs|IL32_uc010btb.2_Frame_Shift_Ins_p.D162fs|IL32_uc002ctl.2_Frame_Shift_Ins_p.D172fs|IL32_uc002ctm.2_Frame_Shift_Ins_p.D172fs|IL32_uc002ctn.2_Frame_Shift_Ins_p.D172fs|IL32_uc002cts.3_Frame_Shift_Ins_p.D172fs|IL32_uc002ctp.2_Frame_Shift_Ins_p.D152fs|IL32_uc002ctq.2_Frame_Shift_Ins_p.D218fs|IL32_uc002ctr.2_Frame_Shift_Ins_p.D152fs|IL32_uc002ctt.2_Frame_Shift_Ins_p.D172fs|IL32_uc010uwr.1_Frame_Shift_Ins_p.D132fs|IL32_uc002ctu.2_Frame_Shift_Ins_p.D163fs NM_004221 NP_004212 P24001 IL32_HUMAN interleukin 32 isoform B 218 cell adhesion|defense response|immune response extracellular space cytokine activity pancreas(1) 1 CCACGGGGGGACAAGGAGGAGC 0.574 16 | ZNF263 10127 broad.mit.edu 37 16 3339555 3339555 + Missense_Mutation SNP A G G TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr16:3339555A>G uc002cuq.2 + 6 1381 c.1049A>G c.(1048-1050)GAG>GGG p.E350G ZNF263_uc010uww.1_5'UTR|ZNF263_uc002cur.2_5'UTR NM_005741 NP_005732 O14978 ZN263_HUMAN zinc finger protein 263 350 viral reproduction nucleus DNA binding|sequence-specific DNA binding transcription factor activity|zinc ion binding skin(3)|ovary(1) 4 CCTCCCCCAGAGGGTGGAATG 0.617 17 | ADCY9 115 broad.mit.edu 37 16 4016471 4016471 + Missense_Mutation SNP C T T TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr16:4016471C>T uc002cvx.2 - 11 3906 c.3367G>A c.(3367-3369)GCG>ACG p.A1123T NM_001116 NP_001107 O60503 ADCY9_HUMAN adenylate cyclase 9 1123 Guanylate cyclase 2.|Cytoplasmic (Potential). activation of adenylate cyclase activity by G-protein signaling pathway|activation of phospholipase C activity|activation of protein kinase A activity|cellular response to glucagon stimulus|energy reserve metabolic process|inhibition of adenylate cyclase activity by G-protein signaling pathway|nerve growth factor receptor signaling pathway|synaptic transmission|transmembrane transport|water transport integral to plasma membrane adenylate cyclase activity|ATP binding|metal ion binding ovary(4)|large_intestine(1)|central_nervous_system(1) 6 TGGGCCTGCGCGGTGTTCAGC 0.602 18 | RRN3P1 730092 broad.mit.edu 37 16 21817457 21817457 + Silent SNP G A A rs150520281 by1000genomes TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr16:21817457G>A uc010vbl.1 - 7 603 c.106C>T c.(106-108)CTG>TTG p.L36L uc002diq.3_Intron NR_003370 SubName: Full=Putative uncharacterized protein ENSP00000219758; 0 CTTACATCCAGCTTGAGTAGT 0.254 19 | TERF2IP 54386 broad.mit.edu 37 16 75690204 75690206 + In_Frame_Del DEL GAA - - rs140846731 TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr16:75690204_75690206delGAA uc002fet.1 + 3 992_994 c.895_897delGAA c.(895-897)GAAdel p.E304del NM_018975 NP_061848 Q9NYB0 TE2IP_HUMAN telomeric repeat binding factor 2, interacting 304 Asp/Glu-rich (acidic). negative regulation of DNA recombination at telomere|negative regulation of telomere maintenance|positive regulation of I-kappaB kinase/NF-kappaB cascade|positive regulation of NF-kappaB transcription factor activity|protection from non-homologous end joining at telomere|protein localization to chromosome, telomeric region|regulation of double-strand break repair via homologous recombination|telomere maintenance via telomerase|transcription, DNA-dependent cytoplasm|nuclear telomere cap complex|nucleoplasm DNA binding|protein binding central_nervous_system(1) 1 TGATgaggaggaagaagaagaag 0.369 20 | NF1 4763 broad.mit.edu 37 17 29533304 29533304 + Nonsense_Mutation SNP C A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr17:29533304C>A uc002hgg.2 + 12 1640 c.1307C>A c.(1306-1308)TCG>TAG p.S436* NF1_uc002hge.1_Nonsense_Mutation_p.S436*|NF1_uc002hgf.1_Nonsense_Mutation_p.S436*|NF1_uc002hgh.2_Nonsense_Mutation_p.S436*|NF1_uc010csn.1_Nonsense_Mutation_p.S296* NM_001042492 NP_001035957 P21359 NF1_HUMAN neurofibromin isoform 1 436 actin cytoskeleton organization|adrenal gland development|artery morphogenesis|camera-type eye morphogenesis|cerebral cortex development|collagen fibril organization|forebrain astrocyte development|forebrain morphogenesis|heart development|liver development|MAPKKK cascade|metanephros development|myelination in peripheral nervous system|negative regulation of cell migration|negative regulation of endothelial cell proliferation|negative regulation of MAP kinase activity|negative regulation of MAPKKK cascade|negative regulation of neuroblast proliferation|negative regulation of oligodendrocyte differentiation|negative regulation of transcription factor import into nucleus|osteoblast differentiation|phosphatidylinositol 3-kinase cascade|pigmentation|positive regulation of adenylate cyclase activity|positive regulation of neuron apoptosis|Ras protein signal transduction|regulation of blood vessel endothelial cell migration|regulation of bone resorption|response to hypoxia|smooth muscle tissue development|spinal cord development|sympathetic nervous system development|visual learning|wound healing axon|cytoplasm|dendrite|intrinsic to internal side of plasma membrane|nucleus protein binding|Ras GTPase activator activity p.?(2) soft_tissue(159)|central_nervous_system(56)|lung(28)|large_intestine(27)|haematopoietic_and_lymphoid_tissue(18)|ovary(18)|autonomic_ganglia(12)|breast(3)|skin(3)|stomach(2)|thyroid(1)|prostate(1)|kidney(1)|pancreas(1) 330 all_cancers(10;1.29e-12)|all_epithelial(10;0.00347)|all_hematologic(16;0.00556)|Acute lymphoblastic leukemia(14;0.00593)|Breast(31;0.014)|Myeloproliferative disorder(56;0.0255)|all_lung(9;0.0321)|Lung NSC(157;0.0659) UCEC - Uterine corpus endometrioid carcinoma (4;4.38e-05)|all cancers(4;1.64e-26)|Epithelial(4;9.15e-23)|OV - Ovarian serous cystadenocarcinoma(4;3.58e-21)|GBM - Glioblastoma multiforme(4;0.00146) TATTGTCACTCGGTTGAACTT 0.413 D|Mis|N|F|S|O neurofibroma|glioma neurofibroma|glioma Neurofibromatosis_type_1 TCGA GBM(6;<1E-08)|TSP Lung(7;0.0071)|TCGA Ovarian(3;0.0088) 21 | CYP4F11 57834 broad.mit.edu 37 19 16034748 16034748 + Silent SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr19:16034748G>A uc002nbu.2 - 7 828 c.792C>T c.(790-792)CAC>CAT p.H264H CYP4F11_uc010eab.1_Silent_p.H264H|CYP4F11_uc002nbt.2_Silent_p.H264H NM_001128932 NP_001122404 Q9HBI6 CP4FB_HUMAN cytochrome P450 family 4 subfamily F polypeptide 264 inflammatory response|xenobiotic metabolic process endoplasmic reticulum membrane|integral to membrane|microsome aromatase activity|electron carrier activity|heme binding ovary(1) 1 CTGTGAAGTCGTGCACCAGGT 0.527 22 | USE1 55850 broad.mit.edu 37 19 17329200 17329200 + Missense_Mutation SNP C T T TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr19:17329200C>T uc002nfo.2 + 6 482 c.422C>T c.(421-423)ACT>ATT p.T141I USE1_uc002nfn.2_3'UTR|USE1_uc010eal.1_Missense_Mutation_p.T141I NM_018467 NP_060937 Q9NZ43 USE1_HUMAN unconventional SNARE in the ER 1 homolog 141 Cytoplasmic (Potential). lysosomal transport|protein catabolic process|protein transport|secretion by cell|vesicle-mediated transport endoplasmic reticulum membrane|integral to membrane protein binding 0 AGGAAGAGAACGTGAGTGTCT 0.582 23 | PSG1 5669 broad.mit.edu 37 19 43382389 43382389 + Missense_Mutation SNP C T T TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr19:43382389C>T uc002ovb.2 - 2 244 c.106G>A c.(106-108)GTC>ATC p.V36I PSG3_uc002ouf.2_Intron|PSG1_uc002oug.1_Missense_Mutation_p.V36I|PSG11_uc002ouw.2_Intron|PSG7_uc002ous.1_Intron|PSG7_uc002out.1_Intron|PSG10_uc002ouv.1_Intron|PSG1_uc002oun.2_RNA|PSG1_uc002our.1_Missense_Mutation_p.V36I|PSG1_uc010eio.1_Missense_Mutation_p.V36I|PSG1_uc002oux.1_5'UTR|PSG1_uc002ouy.1_Missense_Mutation_p.V36I|PSG1_uc002ouz.1_Missense_Mutation_p.V36I|PSG1_uc002ova.1_Missense_Mutation_p.V36I|PSG1_uc002ovc.2_Missense_Mutation_p.V36I|PSG1_uc002ovd.1_Missense_Mutation_p.V36I NM_006905 NP_008836 P11464 PSG1_HUMAN pregnancy specific beta-1-glycoprotein 1 36 Ig-like V-type. female pregnancy extracellular region ovary(2) 2 Prostate(69;0.00682) TCAATCGTGACTTGGGCAGTG 0.463 24 | CACNG6 59285 broad.mit.edu 37 19 54503003 54503003 + Silent SNP A G G TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr19:54503003A>G uc002qct.2 + 3 1112 c.522A>G c.(520-522)GGA>GGG p.G174G CACNG6_uc002qcu.2_Intron|CACNG6_uc002qcv.2_Intron NM_145814 NP_665813 Q9BXT2 CCG6_HUMAN voltage-dependent calcium channel gamma-6 174 Helical; (Potential). voltage-gated calcium channel complex voltage-gated calcium channel activity ovary(2) 2 all_cancers(19;0.0128)|all_epithelial(19;0.00564)|all_lung(19;0.031)|Lung NSC(19;0.0358)|Ovarian(34;0.19) GBM - Glioblastoma multiforme(134;0.168) TCCGAGTTGGAGCCGTCTGCT 0.587 25 | LY75 4065 broad.mit.edu 37 2 160755280 160755280 + Missense_Mutation SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr2:160755280G>A uc002ubc.3 - 2 454 c.385C>T c.(385-387)CAT>TAT p.H129Y LY75_uc002ubb.3_Missense_Mutation_p.H129Y|LY75_uc010fos.2_Missense_Mutation_p.H129Y|LY75_uc010fot.1_Missense_Mutation_p.H129Y NM_002349 NP_002340 O60449 LY75_HUMAN lymphocyte antigen 75 precursor 129 Extracellular (Potential).|Ricin B-type lectin. endocytosis|immune response|inflammatory response integral to plasma membrane receptor activity|sugar binding 0 COAD - Colon adenocarcinoma(177;0.132) GCTGTGCCATGTCCATCCTTC 0.522 26 | SYN3 8224 broad.mit.edu 37 22 32937634 32937634 + Silent SNP G A A rs148217218 TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr22:32937634G>A uc003amx.2 - 7 999 c.840C>T c.(838-840)TAC>TAT p.Y280Y SYN3_uc003amy.2_Silent_p.Y280Y|SYN3_uc003amz.2_Silent_p.Y279Y NM_003490 NP_003481 O14994 SYN3_HUMAN synapsin III isoform IIIa 280 C; actin-binding and synaptic-vesicle binding. neurotransmitter secretion cell junction|synaptic vesicle membrane ATP binding|ligase activity skin(1) 1 CGGTGGTGGCGTAGGTTTTGG 0.552 27 | SI 6476 broad.mit.edu 37 3 164786544 164786544 + Missense_Mutation SNP G T T TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr3:164786544G>T uc003fei.2 - 5 511 c.449C>A c.(448-450)ACT>AAT p.T150N NM_001041 NP_001032 P14410 SUIS_HUMAN sucrase-isomaltase 150 Lumenal.|Isomaltase. carbohydrate metabolic process|polysaccharide digestion apical plasma membrane|brush border|Golgi apparatus|integral to membrane carbohydrate binding|oligo-1,6-glucosidase activity|sucrose alpha-glucosidase activity ovary(7)|upper_aerodigestive_tract(4)|skin(2)|pancreas(1) 14 Prostate(884;0.00314)|Melanoma(1037;0.0153)|all_neural(597;0.0199) Acarbose(DB00284) CTGATTTTGAGTTGTGAAGAG 0.323 HNSCC(35;0.089) 28 | PYDC2 152138 broad.mit.edu 37 3 191179074 191179074 + Silent SNP C T T rs141891926 by1000genomes TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr3:191179074C>T uc011bso.1 + 1 123 c.123C>T c.(121-123)ACC>ACT p.T41T NM_001083308 NP_001076777 Q56P42 PYDC2_HUMAN pyrin domain containing 2 41 DAPIN. cytoplasm|nucleus 0 AGCTACAGACCGTCCCCCAGA 0.542 29 | KLHL5 51088 broad.mit.edu 37 4 39116788 39116788 + Silent SNP C G G TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr4:39116788C>G uc003gts.2 + 10 2124 c.2049C>G c.(2047-2049)CCC>CCG p.P683P KLHL5_uc003gtp.2_Silent_p.P637P|KLHL5_uc003gtq.2_Silent_p.P496P|KLHL5_uc003gtr.1_Silent_p.P683P|KLHL5_uc003gtt.2_Silent_p.P622P NM_015990 NP_057074 Q96PQ7 KLHL5_HUMAN kelch-like 5 isoform 1 683 Kelch 5. cytoplasm|cytoskeleton actin binding ovary(1) 1 GATATGATCCCAAAACAGACA 0.383 30 | GPRIN3 285513 broad.mit.edu 37 4 90170302 90170302 + Silent SNP C T T rs145721148 byFrequency TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr4:90170302C>T uc003hsm.1 - 2 1479 c.960G>A c.(958-960)GCG>GCA p.A320A NM_198281 NP_938022 Q6ZVF9 GRIN3_HUMAN G protein-regulated inducer of neurite outgrowth 320 ovary(3) 3 Hepatocellular(203;0.114) OV - Ovarian serous cystadenocarcinoma(123;5.67e-05) CCTGCACCTCCGCATCTTGCC 0.537 31 | HEATR7B2 133558 broad.mit.edu 37 5 41048449 41048449 + Missense_Mutation SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr5:41048449G>A uc003jmj.3 - 16 2151 c.1661C>T c.(1660-1662)CCT>CTT p.P554L HEATR7B2_uc003jmi.3_Missense_Mutation_p.P109L NM_173489 NP_775760 Q7Z745 HTRB2_HUMAN HEAT repeat family member 7B2 554 HEAT 6. binding ovary(6)|central_nervous_system(2) 8 CAGAAGCTCAGGTAAACGTGT 0.468 32 | KCTD16 57528 broad.mit.edu 37 5 143853547 143853547 + Missense_Mutation SNP A C C TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr5:143853547A>C uc003lnm.1 + 4 1786 c.1157A>C c.(1156-1158)AAA>ACA p.K386T KCTD16_uc003lnn.1_Missense_Mutation_p.K386T NM_020768 NP_065819 Q68DU8 KCD16_HUMAN potassium channel tetramerisation domain 386 cell junction|postsynaptic membrane|presynaptic membrane|voltage-gated potassium channel complex voltage-gated potassium channel activity large_intestine(2)|ovary(1)|skin(1) 4 all_hematologic(541;0.118) KIRC - Kidney renal clear cell carcinoma(527;0.00111)|Kidney(363;0.00176) AAAGCTGTTAAAGAAAAGCTC 0.443 33 | UNC5A 90249 broad.mit.edu 37 5 176301527 176301527 + Silent SNP C T T TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr5:176301527C>T uc003mey.2 + 8 1530 c.1338C>T c.(1336-1338)ACC>ACT p.T446T UNC5A_uc010jkg.1_Silent_p.T406T NM_133369 NP_588610 Q6ZN44 UNC5A_HUMAN netrin receptor Unc5h1 precursor 446 ZU5.|Cytoplasmic (Potential). apoptosis|axon guidance|regulation of apoptosis integral to membrane|plasma membrane skin(1) 1 all_cancers(89;0.000119)|Renal(175;0.000269)|Lung NSC(126;0.00696)|all_lung(126;0.0115) Medulloblastoma(196;0.00498)|all_neural(177;0.0138) Kidney(164;2.23e-05)|KIRC - Kidney renal clear cell carcinoma(164;0.000178) CCTATGGGACCTTCAACTTCC 0.627 34 | GRM3 2913 broad.mit.edu 37 7 86469103 86469103 + Missense_Mutation SNP C T T rs141671463 TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr7:86469103C>T uc003uid.2 + 4 3372 c.2273C>T c.(2272-2274)ACG>ATG p.T758M GRM3_uc010lef.2_Intron|GRM3_uc010leg.2_Missense_Mutation_p.T630M|GRM3_uc010leh.2_Missense_Mutation_p.T350M NM_000840 NP_000831 Q14832 GRM3_HUMAN glutamate receptor, metabotropic 3 precursor 758 Cytoplasmic (Potential). synaptic transmission integral to plasma membrane p.T758M(1) lung(4)|ovary(3)|central_nervous_system(2)|skin(2)|haematopoietic_and_lymphoid_tissue(1)|prostate(1) 13 Esophageal squamous(14;0.0058)|all_lung(186;0.132)|Lung NSC(181;0.142) Acamprosate(DB00659)|Nicotine(DB00184) GCCTTCAAAACGCGGAAGTGC 0.428 35 | CYP3A5 1577 broad.mit.edu 37 7 99262902 99262902 + Missense_Mutation SNP C G G TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr7:99262902C>G uc003urq.2 - 7 644 c.557G>C c.(556-558)GGC>GCC p.G186A ZNF498_uc003urn.2_Intron|CYP3A5_uc003urp.2_Missense_Mutation_p.G6A|CYP3A5_uc003urr.2_Missense_Mutation_p.G73A|CYP3A5_uc011kiy.1_Missense_Mutation_p.G176A|CYP3A5_uc003urs.2_Intron|CYP3A5_uc010lgg.2_Intron NM_000777 NP_000768 P20815 CP3A5_HUMAN cytochrome P450, family 3, subfamily A, 186 alkaloid catabolic process|drug catabolic process|oxidative demethylation|steroid metabolic process|xenobiotic metabolic process endoplasmic reticulum membrane|microsome aromatase activity|electron carrier activity|heme binding|oxygen binding 0 all_epithelial(64;2.77e-08)|Lung NSC(181;0.00396)|all_lung(186;0.00659)|Esophageal squamous(72;0.0166) Alfentanil(DB00802)|Clopidogrel(DB00758)|Cyclosporine(DB00091)|Daunorubicin(DB00694)|Indinavir(DB00224)|Irinotecan(DB00762)|Ketoconazole(DB01026)|Lapatinib(DB01259)|Mephenytoin(DB00532)|Midazolam(DB00683)|Mifepristone(DB00834)|Phenytoin(DB00252)|Quinine(DB00468)|Saquinavir(DB01232)|Tacrolimus(DB00864)|Troleandomycin(DB01361)|Verapamil(DB00661)|Vincristine(DB00541) AAATGATGTGCCAGTAATCAC 0.418 36 | PIP 5304 broad.mit.edu 37 7 142836647 142836647 + Missense_Mutation SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr7:142836647G>A uc003wcf.1 + 4 389 c.353G>A c.(352-354)CGG>CAG p.R118Q NM_002652 NP_002643 P12273 PIP_HUMAN prolactin-induced protein precursor 118 extracellular region actin binding ovary(1) 1 Melanoma(164;0.059) Ovarian(593;2.82e-05)|Breast(660;0.012) BRCA - Breast invasive adenocarcinoma(188;0.0026)|LUSC - Lung squamous cell carcinoma(290;0.0733)|Lung(243;0.08) GATGTTATTCGGGAATTAGGC 0.453 37 | DMRT3 58524 broad.mit.edu 37 9 990484 990484 + Missense_Mutation SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chr9:990484G>A uc003zgw.1 + 2 936 c.898G>A c.(898-900)GCA>ACA p.A300T NM_021240 NP_067063 Q9NQL9 DMRT3_HUMAN doublesex and mab-3 related transcription factor 300 cell differentiation|multicellular organismal development|sex differentiation nucleus DNA binding|metal ion binding|sequence-specific DNA binding transcription factor activity ovary(2)|central_nervous_system(1) 3 all_lung(10;1.39e-08)|Lung NSC(10;1.42e-08) Lung(218;0.0196) GCGAACTTCCGCAGAACCTGA 0.582 38 | ZBED1 9189 broad.mit.edu 37 X 2407462 2407462 + Silent SNP C T T TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chrX:2407462C>T uc004cqg.2 - 2 1500 c.1299G>A c.(1297-1299)ACG>ACA p.T433T DHRSX_uc004cqf.3_Intron|ZBED1_uc004cqh.1_Silent_p.T433T NM_004729 NP_004720 O96006 ZBED1_HUMAN zinc finger, BED-type containing 1 433 nuclear chromosome DNA binding|metal ion binding|protein dimerization activity|transposase activity 0 all_cancers(21;4.28e-07)|all_epithelial(21;2.07e-08)|all_lung(23;2.81e-05)|Lung NSC(23;0.000693)|Lung SC(21;0.122) TGATGTTGAGCGTGGTGTTCA 0.597 39 | RAI2 10742 broad.mit.edu 37 X 17818684 17818684 + Missense_Mutation SNP G C C TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chrX:17818684G>C uc004cyf.2 - 3 2017 c.1447C>G c.(1447-1449)CAA>GAA p.Q483E RAI2_uc004cyg.2_Missense_Mutation_p.Q483E|RAI2_uc010nfa.2_Missense_Mutation_p.Q483E|RAI2_uc004cyh.3_Missense_Mutation_p.Q483E|RAI2_uc011miy.1_Missense_Mutation_p.Q433E NM_021785 NP_068557 Q9Y5P3 RAI2_HUMAN retinoic acid induced 2 483 embryo development ovary(1)|breast(1) 2 Hepatocellular(33;0.183) TCTTCCCCTTGGCTGTTGATG 0.468 40 | EIF2S3 1968 broad.mit.edu 37 X 24073154 24073154 + Silent SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chrX:24073154G>A uc004dbc.2 + 1 90 c.69G>A c.(67-69)TTG>TTA p.L23L NM_001415 NP_001406 P41091 IF2G_HUMAN eukaryotic translation initiation factor 2, 23 cytosol GTP binding|GTPase activity|protein binding|translation initiation factor activity lung(1) 1 TCACCACCTTGGTGAGGTTTT 0.587 OREG0019714 type=REGULATORY REGION|TFbs=CTCF|Dataset=CTCF ChIP-chip sites (Ren lab)|EvidenceSubtype=ChIP-on-chip (ChIP-chip) 41 | PORCN 64840 broad.mit.edu 37 X 48368320 48368320 + Missense_Mutation SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chrX:48368320G>A uc010nie.1 + 2 270 c.112G>A c.(112-114)GCC>ACC p.A38T PORCN_uc004djq.1_Missense_Mutation_p.A151T|PORCN_uc004djr.1_Missense_Mutation_p.A38T|PORCN_uc004djs.1_Missense_Mutation_p.A38T|PORCN_uc004djt.1_5'UTR|PORCN_uc011mlx.1_5'UTR|PORCN_uc004dju.1_5'UTR|PORCN_uc004djv.1_Missense_Mutation_p.A38T|PORCN_uc004djw.1_Missense_Mutation_p.A38T NM_203475 NP_982301 Q9H237 PORCN_HUMAN porcupine isoform D 38 Helical; (Potential).|Leu-rich. Wnt receptor signaling pathway endoplasmic reticulum membrane|integral to membrane acyltransferase activity ovary(2)|central_nervous_system(1) 3 CATCTGCCTCGCCTGCCGCCT 0.413 42 | WNK3 65267 broad.mit.edu 37 X 54276526 54276526 + Nonsense_Mutation SNP G A A TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chrX:54276526G>A uc004dtd.1 - 16 3053 c.2614C>T c.(2614-2616)CGA>TGA p.R872* WNK3_uc004dtc.1_Nonsense_Mutation_p.R872* NM_001002838 NP_001002838 Q9BYP7 WNK3_HUMAN WNK lysine deficient protein kinase 3 isoform 2 872 intracellular protein kinase cascade|positive regulation of establishment of protein localization in plasma membrane|positive regulation of peptidyl-threonine phosphorylation|positive regulation of rubidium ion transmembrane transporter activity|positive regulation of rubidium ion transport|positive regulation of sodium ion transmembrane transporter activity|positive regulation of sodium ion transport|protein autophosphorylation adherens junction|tight junction ATP binding|protein binding|protein serine/threonine kinase activity|rubidium ion transmembrane transporter activity|sodium ion transmembrane transporter activity lung(4)|ovary(3)|kidney(2)|central_nervous_system(2) 11 ATACAGAATCGCCACCGACCA 0.423 43 | IL1RAPL2 26280 broad.mit.edu 37 X 105011568 105011568 + Silent SNP C T T TCGA-06-5410-01A-01D-1696-08 TCGA-06-5410-10A-01D-1696-08 Somatic Phase_I Capture Illumina GAIIx 67244284-dc40-46cb-a2ac-3f4a38f7bbe4 2df41e20-041f-4e1e-86d9-3c38e36c9b33 g.chrX:105011568C>T uc004elz.1 + 11 2731 c.1975C>T c.(1975-1977)CTG>TTG p.L659L NM_017416 NP_059112 Q9NP60 IRPL2_HUMAN interleukin 1 receptor accessory protein-like 2 659 Cytoplasmic (Potential). central nervous system development|innate immune response integral to membrane interleukin-1, Type II, blocking receptor activity breast(2)|ovary(1) 3 TAATAACACCCTGAAAGATAC 0.448 44 | -------------------------------------------------------------------------------- /HW6/data/TCGA-28-5218.maf.txt: -------------------------------------------------------------------------------- 1 | Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_position End_position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_file Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID Genome_Change Annotation_Transcript Transcript_Strand Transcript_Exon Transcript_Position cDNA_Change Codon_Change Protein_Change Other_Transcripts Refseq_mRNA_Id Refseq_prot_Id SwissProt_acc_Id SwissProt_entry_Id Description UniProt_AApos UniProt_Region UniProt_Site UniProt_Natural_Variations UniProt_Experimental_Info GO_Biological_Process GO_Cellular_Component GO_Molecular_Function COSMIC_overlapping_mutations COSMIC_fusion_genes COSMIC_tissue_types_affected COSMIC_total_alterations_in_gene Tumorscape_Amplification_Peaks Tumorscape_Deletion_Peaks TCGAscape_Amplification_Peaks TCGAscape_Deletion_Peaks DrugBank ref_context gc_content CCLE_ONCOMAP_overlapping_mutations CCLE_ONCOMAP_total_mutations_in_gene CGC_Mutation_Type CGC_Translocation_Partner CGC_Tumor_Types_Somatic CGC_Tumor_Types_Germline CGC_Other_Diseases DNARepairGenes_Role FamilialCancerDatabase_Syndromes MUTSIG_Published_Results OREGANNO_ID OREGANNO_Values 2 | ZBTB40 9923 broad.mit.edu 37 1 22835047 22835047 + Missense_Mutation SNP G T T TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr1:22835047G>T uc001bft.2 + 9 2033 c.1522G>T c.(1522-1524)GAC>TAC p.D508Y ZBTB40_uc001bfu.2_Missense_Mutation_p.D508Y|ZBTB40_uc009vqi.1_Missense_Mutation_p.D396Y|ZBTB40_uc001bfv.1_Missense_Mutation_p.D137Y NM_001083621 NP_001077090 Q9NUA8 ZBT40_HUMAN zinc finger and BTB domain containing 40 508 bone mineralization|regulation of transcription, DNA-dependent|response to DNA damage stimulus|transcription, DNA-dependent nucleus DNA binding|zinc ion binding ovary(1) 1 Colorectal(325;3.46e-05)|Lung NSC(340;6.55e-05)|all_lung(284;9.87e-05)|Renal(390;0.000219)|Breast(348;0.00222)|Ovarian(437;0.00308)|Myeloproliferative disorder(586;0.0255) UCEC - Uterine corpus endometrioid carcinoma (279;0.0228)|OV - Ovarian serous cystadenocarcinoma(117;2.86e-26)|Colorectal(126;8.55e-08)|COAD - Colon adenocarcinoma(152;4.1e-06)|GBM - Glioblastoma multiforme(114;1.39e-05)|BRCA - Breast invasive adenocarcinoma(304;0.000712)|KIRC - Kidney renal clear cell carcinoma(1967;0.00374)|STAD - Stomach adenocarcinoma(196;0.00645)|READ - Rectum adenocarcinoma(331;0.0693)|Lung(427;0.216) TGTGAAACGTGACTCTGGTTC 0.483 3 | HMCN1 83872 broad.mit.edu 37 1 186121993 186121993 + Missense_Mutation SNP T G G TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr1:186121993T>G uc001grq.1 + 96 15237 c.15008T>G c.(15007-15009)GTC>GGC p.V5003G HMCN1_uc001grs.1_Missense_Mutation_p.V572G NM_031935 NP_114141 Q96RW7 HMCN1_HUMAN hemicentin 1 precursor 5003 Nidogen G2 beta-barrel. response to stimulus|visual perception basement membrane calcium ion binding ovary(22)|skin(1) 23 CCTGCTGAAGTCACTGTAAAG 0.438 4 | OBSCN 84033 broad.mit.edu 37 1 228559651 228559651 + Missense_Mutation SNP C T T TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr1:228559651C>T uc009xez.1 + 94 21216 c.21172C>T c.(21172-21174)CCT>TCT p.P7058S OBSCN_uc001hsr.1_Missense_Mutation_p.P1687S NM_001098623 NP_001092093 Q5VST9 OBSCN_HUMAN obscurin, cytoskeletal calmodulin and 7058 Pro-rich. apoptosis|cell differentiation|induction of apoptosis by extracellular signals|multicellular organismal development|nerve growth factor receptor signaling pathway|regulation of Rho protein signal transduction|small GTPase mediated signal transduction cytosol|M band|Z disc ATP binding|metal ion binding|protein binding|protein serine/threonine kinase activity|protein tyrosine kinase activity|Rho guanyl-nucleotide exchange factor activity|structural constituent of muscle|titin binding stomach(8)|large_intestine(7)|breast(5)|ovary(4)|skin(2)|central_nervous_system(1)|pancreas(1) 28 Prostate(94;0.0405) CCCATGCCCTCCTGGCTCCTT 0.672 5 | KIAA1804 84451 broad.mit.edu 37 1 233518426 233518426 + Missense_Mutation SNP T C C TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr1:233518426T>C uc001hvt.3 + 10 3341 c.3080T>C c.(3079-3081)ATA>ACA p.I1027T KIAA1804_uc001hvu.3_Missense_Mutation_p.I473T NM_032435 NP_115811 Q5TCX8 M3KL4_HUMAN mixed lineage kinase 4 1027 activation of JUN kinase activity|protein autophosphorylation ATP binding|MAP kinase kinase kinase activity|protein homodimerization activity lung(5)|central_nervous_system(2)|skin(1) 8 all_cancers(173;0.000405)|all_epithelial(177;0.0345)|Prostate(94;0.122) CGGCCATCTATATATGAACTG 0.428 6 | HSD17B7P2 158160 broad.mit.edu 37 10 38654432 38654432 + Missense_Mutation SNP A G G rs2257765 TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr10:38654432A>G uc010qex.1 + 5 599 c.524A>G c.(523-525)AAT>AGT p.N175S HSD17B7P2_uc001izq.2_RNA|HSD17B7P2_uc001izo.1_RNA|HSD17B7P2_uc001izp.1_Missense_Mutation_p.N173S SubName: Full=cDNA FLJ60462, highly similar to 3-keto-steroid reductase (EC 1.1.1.270); 0 TCATCTCGCAATGCAAGGAAA 0.453 7 | PTPRE 5791 broad.mit.edu 37 10 129861345 129861345 + Splice_Site SNP A T T TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr10:129861345A>T uc001lkb.2 + 10 905 c.626_splice c.e10-2 p.G209_splice PTPRE_uc009yat.2_Splice_Site_p.G220_splice|PTPRE_uc010qup.1_Splice_Site|PTPRE_uc009yau.2_Splice_Site_p.G209_splice|PTPRE_uc001lkd.2_Splice_Site_p.G151_splice|PTPRE_uc010quq.1_Splice_Site_p.G110_splice NM_006504 NP_006495 P23469 PTPRE_HUMAN protein tyrosine phosphatase, receptor type, E negative regulation of insulin receptor signaling pathway|protein phosphorylation cytoplasm|integral to membrane|intermediate filament cytoskeleton|nucleus|plasma membrane transmembrane receptor protein tyrosine phosphatase activity ovary(1) 1 all_epithelial(44;1.66e-05)|all_lung(145;0.00456)|Lung NSC(174;0.0066)|all_neural(114;0.0936)|Colorectal(57;0.141)|Breast(234;0.166)|Melanoma(40;0.203) CTCTACACACAGGTCCCAAAC 0.522 8 | MEN1 4221 broad.mit.edu 37 11 64575521 64575521 + Nonsense_Mutation SNP G A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr11:64575521G>A uc001obj.2 - 3 584 c.511C>T c.(511-513)CAG>TAG p.Q171* MEN1_uc001obk.2_Nonsense_Mutation_p.Q171*|MEN1_uc001obl.2_Nonsense_Mutation_p.Q166*|MEN1_uc001obm.2_Nonsense_Mutation_p.Q166*|MEN1_uc001obn.2_Nonsense_Mutation_p.Q171*|MEN1_uc001obo.2_Nonsense_Mutation_p.Q171*|MEN1_uc001obp.2_Nonsense_Mutation_p.Q166*|MEN1_uc001obq.2_Nonsense_Mutation_p.Q171*|MEN1_uc001obr.2_Nonsense_Mutation_p.Q171* NM_130800 NP_570712 O00255 MEN1_HUMAN menin isoform 1 171 Missing (in MEN1). DNA repair|histone lysine methylation|MAPKKK cascade|negative regulation of cell proliferation|negative regulation of cyclin-dependent protein kinase activity|negative regulation of JNK cascade|negative regulation of osteoblast differentiation|negative regulation of protein phosphorylation|negative regulation of sequence-specific DNA binding transcription factor activity|negative regulation of telomerase activity|negative regulation of transcription from RNA polymerase II promoter|osteoblast development|positive regulation of protein binding|positive regulation of transforming growth factor beta receptor signaling pathway|response to gamma radiation|response to UV|transcription, DNA-dependent chromatin|cleavage furrow|cytosol|histone methyltransferase complex|nuclear matrix|soluble fraction double-stranded DNA binding|four-way junction DNA binding|protein binding, bridging|protein N-terminus binding|R-SMAD binding|transcription regulatory region DNA binding|Y-form DNA binding p.R171Q(1) parathyroid(105)|pancreas(64)|gastrointestinal_tract_(site_indeterminate)(15)|small_intestine(13)|lung(9)|pituitary(7)|NS(7)|adrenal_gland(5)|soft_tissue(4)|central_nervous_system(4)|thymus(2)|stomach(1)|retroperitoneum(1)|skin(1) 238 CCCAGGGCCTGGCAGGCCCCA 0.602 D|Mis|N|F|S parathyroid tumors|Pancreatic neuroendocrine tumors parathyroid adenoma|pituitary adenoma|pancreatic islet cell|carcinoid Hyperparathyroidism_Familial_Isolated|Multiple_Endocrine_Neoplasia_type_1 9 | KRTAP5-11 440051 broad.mit.edu 37 11 71293418 71293418 + Missense_Mutation SNP T G G TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr11:71293418T>G uc001oqu.2 - 1 504 c.466A>C c.(466-468)ATC>CTC p.I156L NM_001005405 NP_001005405 Q6L8G4 KR511_HUMAN keratin associated protein 5-11 156 keratin filament 0 GAGCCTCAGATCTTACACTGG 0.308 10 | INPPL1 3636 broad.mit.edu 37 11 71942586 71942586 + Frame_Shift_Del DEL C - - TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr11:71942586delC uc001osf.2 + 13 1689 c.1542delC c.(1540-1542)GTCfs p.V514fs INPPL1_uc001osg.2_Frame_Shift_Del_p.V272fs NM_001567 NP_001558 O15357 SHIP2_HUMAN inositol polyphosphate phosphatase-like 1 514 actin filament organization|cell adhesion|endocytosis actin cortical patch|cytosol actin binding|SH2 domain binding|SH3 domain binding skin(2)|ovary(1)|breast(1) 4 CAGTGCTGGTCAAGCCAGAGC 0.567 11 | RAB30 27314 broad.mit.edu 37 11 82693315 82693315 + Silent SNP G A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr11:82693315G>A uc001ozu.2 - 6 765 c.504C>T c.(502-504)TGC>TGT p.C168C RAB30_uc009yve.2_Silent_p.C166C|RAB30_uc010rst.1_Silent_p.C166C|RAB30_uc001ozv.2_3'UTR NM_014488 NP_055303 Q15771 RAB30_HUMAN RAB30, member RAS oncogene family 168 protein transport|small GTPase mediated signal transduction Golgi stack|plasma membrane GTP binding|GTPase activity 0 TGATGAGTCGGCATGCTAAGT 0.438 12 | SESN3 143686 broad.mit.edu 37 11 94924753 94924756 + Frame_Shift_Del DEL TTGC - - TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr11:94924753_94924756delTTGC uc001pfk.1 - 3 376_379 c.154_157delGCAA c.(154-159)GCAAACfs p.A52fs SESN3_uc010rug.1_5'UTR|SESN3_uc001pfl.2_Frame_Shift_Del_p.A52fs NM_144665 NP_653266 P58005 SESN3_HUMAN sestrin 3 52_53 cell cycle arrest nucleus 0 Acute lymphoblastic leukemia(157;2.26e-05)|all_hematologic(158;0.0123) BRCA - Breast invasive adenocarcinoma(274;0.234) TCCACTGTGTTTGCTTGGACAACC 0.368 13 | HELB 92797 broad.mit.edu 37 12 66698566 66698566 + Silent SNP G A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr12:66698566G>A uc001sti.2 + 2 271 c.243G>A c.(241-243)CCG>CCA p.P81P HELB_uc010ssz.1_RNA|HELB_uc009zqt.1_RNA NM_033647 NP_387467 Q8NG08 HELB_HUMAN helicase (DNA) B 81 DNA replication, synthesis of RNA primer ATP binding|ATP-dependent 5'-3' DNA helicase activity|single-stranded DNA-dependent ATP-dependent DNA helicase activity central_nervous_system(1)|pancreas(1) 2 GBM - Glioblastoma multiforme(2;0.000142) GBM - Glioblastoma multiforme(28;0.0265) GACGTTTTCCGATAACAGGTG 0.378 14 | CABP1 9478 broad.mit.edu 37 12 121098105 121098105 + Missense_Mutation SNP G A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr12:121098105G>A uc001tyu.2 + 3 859 c.792G>A c.(790-792)ATG>ATA p.M264I CABP1_uc001tyv.2_Missense_Mutation_p.M121I|CABP1_uc001tyw.2_Missense_Mutation_p.M61I|CABP1_uc001tyx.2_Missense_Mutation_p.M106I NM_001033677 NP_001028849 Q9NZU7 CABP1_HUMAN calcium binding protein 1 isoform 3 264 EF-hand 2. cell cortex|cell junction|Golgi apparatus|perinuclear region of cytoplasm|postsynaptic density|postsynaptic membrane calcium ion binding|calcium-dependent protein binding|enzyme inhibitor activity|protein binding central_nervous_system(1) 1 all_neural(191;0.0684)|Medulloblastoma(191;0.0922) CCACCGAGATGGAGCTCATCG 0.542 15 | HERC2 8924 broad.mit.edu 37 15 28389261 28389261 + Missense_Mutation SNP G A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr15:28389261G>A uc001zbj.2 - 73 11367 c.11261C>T c.(11260-11262)GCG>GTG p.A3754V NM_004667 NP_004658 O95714 HERC2_HUMAN hect domain and RLD 2 3754 DNA repair|intracellular protein transport|protein ubiquitination involved in ubiquitin-dependent protein catabolic process nucleus guanyl-nucleotide exchange factor activity|heme binding|protein binding|ubiquitin-protein ligase activity|zinc ion binding ovary(4)|lung(4)|skin(3)|upper_aerodigestive_tract(1)|central_nervous_system(1) 13 all_lung(180;1.3e-11)|Breast(32;0.000194)|Colorectal(260;0.227) all cancers(64;3.93e-09)|Epithelial(43;9.99e-08)|BRCA - Breast invasive adenocarcinoma(123;0.0271)|GBM - Glioblastoma multiforme(186;0.0497)|Lung(196;0.199) CAGCGAGGCCGCAAGGCGAGG 0.537 16 | WASH3P 374666 broad.mit.edu 37 15 102515344 102515344 + Missense_Mutation SNP A C C rs141089280 by1000genomes TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr15:102515344A>C uc002cdi.2 + 9 1988 c.568A>C c.(568-570)AAG>CAG p.K190Q WASH3P_uc002cdl.2_Missense_Mutation_p.K190Q|WASH3P_uc002cdk.2_RNA|WASH3P_uc002cdp.2_Missense_Mutation_p.K190Q|WASH3P_uc010bpo.2_RNA|WASH3P_uc002cdq.2_RNA|WASH3P_uc002cdr.2_RNA NR_003659 RecName: Full=WAS protein family homolog 2; AltName: Full=Protein FAM39B; AltName: Full=CXYorf1-like protein on chromosome 2; 0 GCTGGAGAAGAAGCAGCAGAA 0.662 17 | MRPS34 65993 broad.mit.edu 37 16 1823074 1823075 + Frame_Shift_Ins INS - G G TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr16:1823074_1823075insG uc002cmo.2 - 1 66_67 c.46_47insC c.(46-48)CGCfs p.R16fs NME3_uc002cmm.2_5'Flank|NME3_uc010brv.2_5'Flank|MRPS34_uc002cmn.2_5'Flank|MRPS34_uc002cmp.1_Frame_Shift_Ins_p.R16fs|EME2_uc002cmq.1_5'Flank|EME2_uc010brw.1_5'Flank NM_023936 NP_076425 P82930 RT34_HUMAN mitochondrial ribosomal protein S34 16 mitochondrion|ribosome protein binding skin(2) 2 GCGCACGCGGCGGGCCAGCTCC 0.723 18 | RNF40 9810 broad.mit.edu 37 16 30774843 30774843 + Silent SNP G A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr16:30774843G>A uc002dzq.2 + 4 528 c.405G>A c.(403-405)GGG>GGA p.G135G C16orf93_uc002dzm.2_5'Flank|C16orf93_uc002dzn.2_5'Flank|C16orf93_uc002dzo.2_5'Flank|C16orf93_uc002dzp.2_5'Flank|RNF40_uc010caa.2_Silent_p.G135G|RNF40_uc010cab.2_Silent_p.G135G|RNF40_uc010vfa.1_Intron|RNF40_uc002dzr.2_Silent_p.G135G|RNF40_uc010vfb.1_Intron NM_014771 NP_055586 O75150 BRE1B_HUMAN ring finger protein 40 135 histone H2B ubiquitination|histone monoubiquitination|ubiquitin-dependent protein catabolic process nucleus|synaptosome|ubiquitin ligase complex protein homodimerization activity|ubiquitin protein ligase binding|zinc ion binding central_nervous_system(1) 1 Colorectal(24;0.198) CATGTGATGGGACTCCTCTCC 0.612 19 | KRTAP1-1 81851 broad.mit.edu 37 17 39197186 39197186 + Missense_Mutation SNP C T T TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr17:39197186C>T uc002hvw.1 - 1 528 c.464G>A c.(463-465)CGC>CAC p.R155H NM_030967 NP_112229 Q07627 KRA11_HUMAN keratin associated protein 1-1 155 extracellular region|keratin filament 0 Breast(137;0.000496) STAD - Stomach adenocarcinoma(17;0.000371) GTAGGATGGGCGGCAGCAGGA 0.637 20 | C19orf10 56005 broad.mit.edu 37 19 4668644 4668644 + Missense_Mutation SNP C T T TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr19:4668644C>T uc002may.2 - 2 257 c.188G>A c.(187-189)TGT>TAT p.C63Y NM_019107 NP_061980 Q969H8 CS010_HUMAN hypothetical protein LOC56005 precursor 63 ER-Golgi intermediate compartment|extracellular region 0 Hepatocellular(1079;0.137) UCEC - Uterine corpus endometrioid carcinoma (162;6.64e-05)|BRCA - Breast invasive adenocarcinoma(158;0.015) AGTGAACATACACGTATATTT 0.313 21 | ZNF317 57693 broad.mit.edu 37 19 9267420 9267420 + Missense_Mutation SNP C T T TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr19:9267420C>T uc002mku.2 + 3 433 c.158C>T c.(157-159)TCC>TTC p.S53F ZNF317_uc010xkm.1_Silent_p.F94F|ZNF317_uc002mkv.2_5'UTR|ZNF317_uc002mkw.2_Missense_Mutation_p.S53F|ZNF317_uc002mkx.2_5'UTR|ZNF317_uc002mky.2_5'UTR NM_020933 NP_065984 Q96PQ6 ZN317_HUMAN zinc finger protein 317 53 regulation of transcription, DNA-dependent|transcription, DNA-dependent nucleus DNA binding|zinc ion binding 0 AGTGTTGGTTCCCAGGTGCAC 0.527 22 | MAN2B1 4125 broad.mit.edu 37 19 12763065 12763065 + Missense_Mutation SNP C G G TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr19:12763065C>G uc002mub.2 - 16 2024 c.1948G>C c.(1948-1950)GAC>CAC p.D650H MAN2B1_uc010dyv.1_Missense_Mutation_p.D649H NM_000528 NP_000519 O00754 MA2B1_HUMAN mannosidase, alpha, class 2B, member 1 650 protein deglycosylation lysosome alpha-mannosidase activity|zinc ion binding ovary(4)|central_nervous_system(2) 6 CTTTCGTTGTCACCTATACTG 0.597 23 | TMEM147 10430 broad.mit.edu 37 19 36037641 36037641 + Missense_Mutation SNP C T T TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr19:36037641C>T uc002oaj.1 + 4 372 c.275C>T c.(274-276)GCC>GTC p.A92V uc010eec.1_5'Flank|uc002oag.2_5'Flank|TMEM147_uc002oai.1_Missense_Mutation_p.A43V|TMEM147_uc002oak.1_Missense_Mutation_p.P2S NM_032635 NP_116024 Q9BVK8 TM147_HUMAN transmembrane protein 147 92 endoplasmic reticulum membrane|integral to membrane protein binding 0 all_lung(56;1.05e-07)|Lung NSC(56;1.63e-07)|Esophageal squamous(110;0.162) LUSC - Lung squamous cell carcinoma(66;0.0724) TCCCGGAATGCCGGCAAGGGA 0.572 24 | EXOSC5 56915 broad.mit.edu 37 19 41895788 41895788 + Missense_Mutation SNP G A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr19:41895788G>A uc002oqo.2 - 4 430 c.407C>T c.(406-408)GCC>GTC p.A136V CYP2F1_uc010xvw.1_Intron|BCKDHA_uc002oqm.3_Intron NM_020158 NP_064543 Q9NQT4 EXOS5_HUMAN exosome component Rrp46 136 DNA deamination|exonucleolytic nuclear-transcribed mRNA catabolic process involved in deadenylation-dependent decay|rRNA processing cytosol|exosome (RNase complex)|nucleolus|transcriptionally active chromatin 3'-5'-exoribonuclease activity|protein binding|RNA binding 0 CATGCAGGCGGCATTCAGACA 0.448 25 | NLRP5 126206 broad.mit.edu 37 19 56539217 56539217 + Missense_Mutation SNP T A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr19:56539217T>A uc002qmj.2 + 7 1618 c.1618T>A c.(1618-1620)TGG>AGG p.W540R NLRP5_uc002qmi.2_Missense_Mutation_p.W521R NM_153447 NP_703148 P59047 NALP5_HUMAN NACHT, LRR and PYD containing protein 5 540 NACHT. mitochondrion|nucleolus ATP binding ovary(3)|skin(2)|kidney(1)|central_nervous_system(1) 7 Colorectal(82;3.46e-05)|Ovarian(87;0.0481)|Renal(1328;0.157) GBM - Glioblastoma multiforme(193;0.0326) GGAGGGAGTGTGGAATAGGAA 0.552 26 | FIGN 55137 broad.mit.edu 37 2 164467616 164467616 + Silent SNP G A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr2:164467616G>A uc002uck.1 - 3 1037 c.726C>T c.(724-726)CTC>CTT p.L242L NM_018086 NP_060556 Q5HY92 FIGN_HUMAN fidgetin 242 Pro-rich. nuclear matrix ATP binding|nucleoside-triphosphatase activity large_intestine(2)|ovary(1)|skin(1) 4 TGTAACTGGAGAGGTTAGAAG 0.612 27 | SNRPB 6628 broad.mit.edu 37 20 2443779 2443779 + Missense_Mutation SNP C T T TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr20:2443779C>T uc002wfz.1 - 5 678 c.515G>A c.(514-516)CGT>CAT p.R172H SNRPB_uc002wga.1_Missense_Mutation_p.R172H|SNRPB_uc010zpv.1_Missense_Mutation_p.R93H|SNRPB_uc002wgb.2_Missense_Mutation_p.R172H|SNORD119_uc010gam.1_5'Flank NM_198216 NP_937859 P14678 RSMB_HUMAN small nuclear ribonucleoprotein polypeptide B/B' 172 RG -> L (in Ref. 4). histone mRNA metabolic process|ncRNA metabolic process|spliceosomal snRNP assembly|termination of RNA polymerase II transcription catalytic step 2 spliceosome|cytosol|nucleoplasm|U12-type spliceosomal complex|U7 snRNP protein binding|protein binding|RNA binding ovary(1) 1 AGGACCCCCACGGCCAGGTGG 0.597 28 | SENP5 205564 broad.mit.edu 37 3 196613120 196613120 + Nonsense_Mutation SNP G A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr3:196613120G>A uc003fwz.3 + 2 1317 c.1068G>A c.(1066-1068)TGG>TGA p.W356* SENP5_uc011bty.1_Nonsense_Mutation_p.W356* NM_152699 NP_689912 Q96HI0 SENP5_HUMAN SUMO1/sentrin specific peptidase 5 356 cell cycle|cell division|proteolysis nucleolus cysteine-type peptidase activity breast(2)|lung(1) 3 all_cancers(143;1.8e-08)|Ovarian(172;0.0634)|Breast(254;0.135) Epithelial(36;3.14e-24)|all cancers(36;2.1e-22)|OV - Ovarian serous cystadenocarcinoma(49;1.03e-18)|LUSC - Lung squamous cell carcinoma(58;1.51e-06)|Lung(62;1.95e-06) GBM - Glioblastoma multiforme(46;0.004) CAAACGCCTGGGACCAGTCAT 0.468 29 | OR2J2 26707 broad.mit.edu 37 6 29142195 29142195 + Silent SNP C G G TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr6:29142195C>G uc011dlm.1 + 1 885 c.783C>G c.(781-783)CTC>CTG p.L261L NM_030905 NP_112167 O76002 OR2J2_HUMAN olfactory receptor, family 2, subfamily J, 261 Extracellular (Potential). sensory perception of smell integral to membrane|plasma membrane olfactory receptor activity 0 GCATGTATCTCCAGCCACCAT 0.433 30 | MUC17 140453 broad.mit.edu 37 7 100677921 100677921 + Missense_Mutation SNP C T T TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr7:100677921C>T uc003uxp.1 + 3 3277 c.3224C>T c.(3223-3225)ACT>ATT p.T1075I MUC17_uc010lho.1_RNA NM_001040105 NP_001035194 Q685J3 MUC17_HUMAN mucin 17 precursor 1075 Extracellular (Potential).|Ser-rich.|59 X approximate tandem repeats.|16. extracellular region|integral to membrane|plasma membrane extracellular matrix constituent, lubricant activity ovary(14)|skin(8)|breast(3)|lung(2) 27 Lung NSC(181;0.136)|all_lung(186;0.182) CCTGTGACCACTTATTCTCAA 0.488 31 | EPHA1 2041 broad.mit.edu 37 7 143098437 143098437 + Nonsense_Mutation SNP G A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr7:143098437G>A uc003wcz.2 - 3 499 c.412C>T c.(412-414)CGA>TGA p.R138* NM_005232 NP_005223 P21709 EPHA1_HUMAN ephrin receptor EphA1 precursor 138 Extracellular (Potential). integral to plasma membrane ATP binding|ephrin receptor activity ovary(3)|lung(1)|breast(1) 5 Melanoma(164;0.205) Myeloproliferative disorder(862;0.0255) AAGGGCCGTCGGAGCTGAATG 0.592 32 | ATP6V1C1 528 broad.mit.edu 37 8 104075258 104075258 + Missense_Mutation SNP C G G TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr8:104075258C>G uc003ykz.3 + 9 962 c.717C>G c.(715-717)CAC>CAG p.H239Q ATP6V1C1_uc010mbz.2_Missense_Mutation_p.H164Q|ATP6V1C1_uc003yla.2_Missense_Mutation_p.H239Q|ATP6V1C1_uc011lhl.1_Missense_Mutation_p.H164Q NM_001695 NP_001686 P21283 VATC1_HUMAN ATPase, H+ transporting, lysosomal V1 subunit 239 ATP hydrolysis coupled proton transport|cellular iron ion homeostasis|insulin receptor signaling pathway|transferrin transport cytosol|plasma membrane|proton-transporting V-type ATPase, V1 domain protein binding|proton-transporting ATPase activity, rotational mechanism 0 Lung NSC(17;0.000427)|all_lung(17;0.000533) OV - Ovarian serous cystadenocarcinoma(57;3.57e-05)|STAD - Stomach adenocarcinoma(118;0.133) ACTTCAGACACAAAGCCAGAG 0.328 33 | LRRC6 23639 broad.mit.edu 37 8 133645122 133645122 + Missense_Mutation SNP C T T TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr8:133645122C>T uc003ytk.2 - 5 591 c.517G>A c.(517-519)GAA>AAA p.E173K LRRC6_uc003ytl.2_RNA NM_012472 NP_036604 Q86X45 LRRC6_HUMAN leucine rich repeat containing 6 173 cytoplasm ovary(1)|kidney(1) 2 Ovarian(258;0.00352)|Esophageal squamous(12;0.00507)|all_neural(3;0.0052)|Medulloblastoma(3;0.0922)|Acute lymphoblastic leukemia(118;0.155) BRCA - Breast invasive adenocarcinoma(115;0.000311) TGATCTTTTTCCTGCTCTCTG 0.398 34 | CDKN2B 1030 broad.mit.edu 37 9 22006044 22006044 + Missense_Mutation SNP G T T TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr9:22006044G>T uc003zpo.2 - 2 719 c.359C>A c.(358-360)GCC>GAC p.A120D MTAP_uc003zpi.1_Intron|CDKN2BAS_uc010miw.1_Intron|CDKN2BAS_uc010mix.1_Intron|CDKN2BAS_uc003zpm.2_Intron|CDKN2B_uc003zpn.2_3'UTR NM_004936 NP_004927 P42772 CDN2B_HUMAN cyclin-dependent kinase inhibitor 2B isoform 1 120 ANK 4. cell cycle arrest|cellular response to nutrient|G1 phase of mitotic cell cycle|G2/M transition of mitotic cell cycle|megakaryocyte differentiation|mitotic cell cycle G1/S transition checkpoint|negative regulation of epithelial cell proliferation|positive regulation of transforming growth factor beta receptor signaling pathway|regulation of cyclin-dependent protein kinase activity cytosol|nucleus cyclin-dependent protein kinase inhibitor activity|protein kinase binding lung(1) 1 all_cancers(5;0)|Acute lymphoblastic leukemia(3;0)|all_hematologic(3;0)|all_epithelial(2;1.31e-280)|Lung NSC(2;2.28e-131)|all_lung(2;2.11e-123)|Glioma(2;5.66e-57)|all_neural(2;3.05e-50)|Renal(3;1.07e-46)|Esophageal squamous(3;3.83e-46)|Melanoma(2;8.01e-33)|Breast(3;1.14e-11)|Ovarian(3;0.000128)|Hepatocellular(5;0.00369)|Colorectal(97;0.172) all cancers(2;0)|GBM - Glioblastoma multiforme(3;0)|Lung(2;3.29e-71)|Epithelial(2;9.08e-60)|LUSC - Lung squamous cell carcinoma(2;5.8e-46)|LUAD - Lung adenocarcinoma(2;1.43e-25)|BRCA - Breast invasive adenocarcinoma(2;5.37e-09)|STAD - Stomach adenocarcinoma(4;4.63e-07)|Kidney(2;6.92e-07)|KIRC - Kidney renal clear cell carcinoma(2;8.63e-07)|OV - Ovarian serous cystadenocarcinoma(39;0.014)|COAD - Colon adenocarcinoma(8;0.143) CCGCTCCTCGGCCAAGTCCAC 0.701 Familial_Malignant_Melanoma_and_Tumors_of_the_Nervous_System 35 | CDKN2B 1030 broad.mit.edu 37 9 22006068 22006068 + Missense_Mutation SNP C A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chr9:22006068C>A uc003zpo.2 - 2 695 c.335G>T c.(334-336)TGG>TTG p.W112L MTAP_uc003zpi.1_Intron|CDKN2BAS_uc010miw.1_Intron|CDKN2BAS_uc010mix.1_Intron|CDKN2BAS_uc003zpm.2_Intron|CDKN2B_uc003zpn.2_3'UTR NM_004936 NP_004927 P42772 CDN2B_HUMAN cyclin-dependent kinase inhibitor 2B isoform 1 112 ANK 4. cell cycle arrest|cellular response to nutrient|G1 phase of mitotic cell cycle|G2/M transition of mitotic cell cycle|megakaryocyte differentiation|mitotic cell cycle G1/S transition checkpoint|negative regulation of epithelial cell proliferation|positive regulation of transforming growth factor beta receptor signaling pathway|regulation of cyclin-dependent protein kinase activity cytosol|nucleus cyclin-dependent protein kinase inhibitor activity|protein kinase binding lung(1) 1 all_cancers(5;0)|Acute lymphoblastic leukemia(3;0)|all_hematologic(3;0)|all_epithelial(2;1.31e-280)|Lung NSC(2;2.28e-131)|all_lung(2;2.11e-123)|Glioma(2;5.66e-57)|all_neural(2;3.05e-50)|Renal(3;1.07e-46)|Esophageal squamous(3;3.83e-46)|Melanoma(2;8.01e-33)|Breast(3;1.14e-11)|Ovarian(3;0.000128)|Hepatocellular(5;0.00369)|Colorectal(97;0.172) all cancers(2;0)|GBM - Glioblastoma multiforme(3;0)|Lung(2;3.29e-71)|Epithelial(2;9.08e-60)|LUSC - Lung squamous cell carcinoma(2;5.8e-46)|LUAD - Lung adenocarcinoma(2;1.43e-25)|BRCA - Breast invasive adenocarcinoma(2;5.37e-09)|STAD - Stomach adenocarcinoma(4;4.63e-07)|Kidney(2;6.92e-07)|KIRC - Kidney renal clear cell carcinoma(2;8.63e-07)|OV - Ovarian serous cystadenocarcinoma(39;0.014)|COAD - Colon adenocarcinoma(8;0.143) CAGACGACCCCAGGCATCGCG 0.726 Familial_Malignant_Melanoma_and_Tumors_of_the_Nervous_System 36 | OTC 5009 broad.mit.edu 37 X 38260629 38260629 + Missense_Mutation SNP T C C TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chrX:38260629T>C uc004def.3 + 5 702 c.488T>C c.(487-489)CTG>CCG p.L163P NM_000531 NP_000522 P00480 OTC_HUMAN ornithine carbamoyltransferase precursor 163 arginine biosynthetic process|urea cycle mitochondrial matrix|ornithine carbamoyltransferase complex ornithine carbamoyltransferase activity ovary(1)|breast(1) 2 L-Citrulline(DB00155)|L-Ornithine(DB00129) ATCAATGGGCTGTCAGATTTG 0.408 37 | HUWE1 10075 broad.mit.edu 37 X 53569470 53569470 + Missense_Mutation SNP G A A TCGA-28-5218-01A-01D-1486-08 TCGA-28-5218-10A-01D-1486-08 Somatic Phase_I Capture Illumina GAIIx 68008a98-3889-4dd2-bcf9-f1f6cbca6355 727e8e46-718d-4e44-96a1-ed3544500a07 g.chrX:53569470G>A uc004dsp.2 - 74 11812 c.11410C>T c.(11410-11412)CGG>TGG p.R3804W HUWE1_uc004dsn.2_Missense_Mutation_p.R2612W|HUWE1_uc004dsq.1_Missense_Mutation_p.R104W NM_031407 NP_113584 Q7Z6Z7 HUWE1_HUMAN HECT, UBA and WWE domain containing 1 3804 base-excision repair|cell differentiation|histone ubiquitination|protein monoubiquitination|protein polyubiquitination|protein ubiquitination involved in ubiquitin-dependent protein catabolic process cytoplasm|nucleus DNA binding|protein binding|ubiquitin-protein ligase activity ovary(8)|large_intestine(4)|breast(4)|kidney(1) 17 TCCTCCCTCCGGACAGACGCC 0.502 38 | -------------------------------------------------------------------------------- /HW6/data/TCGA-32-4209.maf.txt: -------------------------------------------------------------------------------- 1 | Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_position End_position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_file Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID Genome_Change Annotation_Transcript Transcript_Strand Transcript_Exon Transcript_Position cDNA_Change Codon_Change Protein_Change Other_Transcripts Refseq_mRNA_Id Refseq_prot_Id SwissProt_acc_Id SwissProt_entry_Id Description UniProt_AApos UniProt_Region UniProt_Site UniProt_Natural_Variations UniProt_Experimental_Info GO_Biological_Process GO_Cellular_Component GO_Molecular_Function COSMIC_overlapping_mutations COSMIC_fusion_genes COSMIC_tissue_types_affected COSMIC_total_alterations_in_gene Tumorscape_Amplification_Peaks Tumorscape_Deletion_Peaks TCGAscape_Amplification_Peaks TCGAscape_Deletion_Peaks DrugBank ref_context gc_content CCLE_ONCOMAP_overlapping_mutations CCLE_ONCOMAP_total_mutations_in_gene CGC_Mutation_Type CGC_Translocation_Partner CGC_Tumor_Types_Somatic CGC_Tumor_Types_Germline CGC_Other_Diseases DNARepairGenes_Role FamilialCancerDatabase_Syndromes MUTSIG_Published_Results OREGANNO_ID OREGANNO_Values 2 | DNAJC11 55735 broad.mit.edu 37 1 6727822 6727822 + Missense_Mutation SNP G A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr1:6727822G>A uc001aof.2 - 4 431 c.325C>T c.(325-327)CGG>TGG p.R109W DNAJC11_uc010nzt.1_Missense_Mutation_p.R71W|DNAJC11_uc001aog.2_Missense_Mutation_p.R109W|DNAJC11_uc010nzu.1_Missense_Mutation_p.R19W NM_018198 NP_060668 Q9NVH1 DJC11_HUMAN DnaJ (Hsp40) homolog, subfamily C, member 11 109 protein folding heat shock protein binding|unfolded protein binding ovary(1)|skin(1) 2 Ovarian(185;0.0265)|all_lung(157;0.154) all_cancers(23;1.97e-27)|all_epithelial(116;1.76e-17)|all_lung(118;2.27e-05)|Lung NSC(185;9.97e-05)|Renal(390;0.00188)|Breast(487;0.00289)|Colorectal(325;0.00342)|Hepatocellular(190;0.0218)|Myeloproliferative disorder(586;0.0393)|Ovarian(437;0.156) Colorectal(212;2.34e-07)|COAD - Colon adenocarcinoma(227;2.05e-05)|Kidney(185;7.67e-05)|BRCA - Breast invasive adenocarcinoma(304;0.000639)|KIRC - Kidney renal clear cell carcinoma(229;0.00128)|STAD - Stomach adenocarcinoma(132;0.00179)|READ - Rectum adenocarcinoma(331;0.0649) CTCTGCAGCCGCTCAAACTCC 0.522 3 | MST1P9 11223 broad.mit.edu 37 1 17085479 17085479 + Silent SNP T C C TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr1:17085479T>C uc010ock.1 - 10 1212 c.1212A>G c.(1210-1212)AAA>AAG p.K404K CROCC_uc009voy.1_Intron|MST1P9_uc001azp.3_5'UTR NR_002729 SubName: Full=Hepatocyte growth factor-like protein homolog; 0 GTCTCAACCATTTCCAGGCTC 0.617 4 | LPAR3 23566 broad.mit.edu 37 1 85331664 85331665 + Frame_Shift_Ins INS - A A rs76299065 TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr1:85331664_85331665insA uc001dkl.2 - 1 178_179 c.139_140insT c.(139-141)TCTfs p.S47fs LPAR3_uc009wcj.1_Frame_Shift_Ins_p.S47fs NM_012152 NP_036284 Q9UBY5 LPAR3_HUMAN lysophosphatidic acid receptor 3 47 Helical; Name=1; (Potential). G-protein signaling, coupled to cyclic nucleotide second messenger|synaptic transmission integral to plasma membrane|intracellular membrane-bounded organelle lung(3)|ovary(2) 5 CAGAGAATTAGAAAAAAAAATA 0.401 5 | NBPF10 100132406 broad.mit.edu 37 1 145324371 145324371 + Missense_Mutation SNP T C C TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr1:145324371T>C uc001end.3 + 30 3826 c.3791T>C c.(3790-3792)GTA>GCA p.V1264A NBPF10_uc009wir.2_Intron|NBPF9_uc010oye.1_Intron|NBPF10_uc001emp.3_Intron|NBPF10_uc010oyi.1_Intron|NBPF10_uc010oyk.1_Intron|NBPF10_uc010oyl.1_Intron|NBPF10_uc001enc.2_Intron|NBPF10_uc010oym.1_Intron|NBPF10_uc010oyn.1_Intron|NBPF10_uc010oyo.1_Intron|NBPF10_uc010oyp.1_RNA NM_001039703 NP_001034792 A6NDV3 A6NDV3_HUMAN hypothetical protein LOC100132406 1189 0 all_hematologic(923;0.032) Colorectal(1306;1.36e-07)|KIRC - Kidney renal clear cell carcinoma(1967;0.00258) CTGCTGGAGGTAGTAGCGCCT 0.498 6 | LOC645166 645166 broad.mit.edu 37 1 148933289 148933289 + Splice_Site SNP A G G rs9729175 by1000genomes TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr1:148933289A>G uc010pbc.1 + 3 c.236_splice c.e3-2 LOC645166_uc010pbd.1_Intron|LOC645166_uc009wkw.1_Splice_Site NR_027355 Homo sapiens cDNA, FLJ18771. 0 TGCTGCCCGCAGGATATTGTG 0.562 7 | TDRKH 11022 broad.mit.edu 37 1 151755433 151755433 + Silent SNP C T T TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr1:151755433C>T uc009wnb.1 - 2 248 c.66G>A c.(64-66)GGG>GGA p.G22G TDRKH_uc001eyy.2_5'UTR|TDRKH_uc001ezb.3_Silent_p.G22G|TDRKH_uc001ezc.3_Silent_p.G22G|TDRKH_uc001eza.3_Silent_p.G22G|TDRKH_uc001ezd.3_Silent_p.G22G|TDRKH_uc010pdn.1_5'UTR NM_006862 NP_006853 Q9Y2W6 TDRKH_HUMAN tudor and KH domain containing isoform a 22 RNA binding p.G22V(1) ovary(1)|pancreas(1) 2 Hepatocellular(266;0.0877)|all_hematologic(923;0.127)|Melanoma(130;0.14) LUSC - Lung squamous cell carcinoma(543;0.181) TGGCTGGGATCCCAAGGCCCA 0.463 8 | CRTC2 200186 broad.mit.edu 37 1 153921628 153921628 + Missense_Mutation SNP G A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr1:153921628G>A uc010ped.1 - 12 1707 c.1637C>T c.(1636-1638)TCT>TTT p.S546F DENND4B_uc001fdd.1_5'Flank|CRTC2_uc001fde.3_RNA|CRTC2_uc001fdf.3_Missense_Mutation_p.S82F NM_181715 NP_859066 Q53ET0 CRTC2_HUMAN CREB regulated transcription coactivator 2 546 interspecies interaction between organisms|regulation of transcription, DNA-dependent|transcription, DNA-dependent cytoplasm|nucleus protein binding ovary(2) 2 all_lung(78;3.05e-32)|Lung NSC(65;3.74e-30)|Hepatocellular(266;0.0877)|Melanoma(130;0.199) LUSC - Lung squamous cell carcinoma(543;0.151) CCGGTGGTAAGACTGTTGCCC 0.597 9 | OR10J3 441911 broad.mit.edu 37 1 159283999 159283999 + Missense_Mutation SNP C T T TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr1:159283999C>T uc010piu.1 - 1 451 c.451G>A c.(451-453)GGG>AGG p.G151R NM_001004467 NP_001004467 Q5JRS4 O10J3_HUMAN olfactory receptor, family 10, subfamily J, 151 Helical; Name=4; (Potential). sensory perception of smell integral to membrane|plasma membrane olfactory receptor activity ovary(2) 2 all_hematologic(112;0.0429) AGGCCAATCCCCAGTGATCCA 0.507 10 | POU2F1 5451 broad.mit.edu 37 1 167358969 167358969 + Missense_Mutation SNP C G G TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr1:167358969C>G uc001gec.2 + 10 1051 c.889C>G c.(889-891)CAA>GAA p.Q297E POU2F1_uc010plg.1_RNA|POU2F1_uc001ged.2_Missense_Mutation_p.Q295E|POU2F1_uc001gee.2_Missense_Mutation_p.Q297E|POU2F1_uc010plh.1_Missense_Mutation_p.Q234E|POU2F1_uc001gef.2_Missense_Mutation_p.Q309E|POU2F1_uc001geg.2_Missense_Mutation_p.Q195E NM_002697 NP_002688 P14859 PO2F1_HUMAN POU class 2 homeobox 1 297 POU-specific. negative regulation of transcription, DNA-dependent|transcription from RNA polymerase III promoter nucleoplasm protein binding|sequence-specific DNA binding|sequence-specific DNA binding transcription factor activity central_nervous_system(2)|skin(2)|breast(1) 5 GACCTTCAAACAAAGACGAAT 0.438 11 | C1orf26 54823 broad.mit.edu 37 1 185143825 185143825 + Missense_Mutation SNP G C C rs146489629 byFrequency TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr1:185143825G>C uc001grg.3 + 5 660 c.546G>C c.(544-546)AAG>AAC p.K182N C1orf26_uc001grh.3_Missense_Mutation_p.K182N NM_001105518 NP_001098988 Q5T5J6 SWT1_HUMAN hypothetical protein LOC54823 182 0 AGAGAGAGAAGATGAAAGAAC 0.353 12 | CFH 3075 broad.mit.edu 37 1 196694295 196694295 + Missense_Mutation SNP G A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr1:196694295G>A uc001gtj.3 + 12 1981 c.1741G>A c.(1741-1743)GAT>AAT p.D581N NM_000186 NP_000177 P08603 CFAH_HUMAN complement factor H isoform a precursor 581 Sushi 10. complement activation, alternative pathway extracellular space skin(4)|ovary(1)|breast(1) 6 CTTAGTTCCTGATCGCAAGAA 0.343 13 | TLL2 7093 broad.mit.edu 37 10 98155658 98155658 + Missense_Mutation SNP C T T TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr10:98155658C>T uc001kml.1 - 12 1730 c.1504G>A c.(1504-1506)GTG>ATG p.V502M TLL2_uc009xvf.1_Missense_Mutation_p.V480M NM_012465 NP_036597 Q9Y6L7 TLL2_HUMAN tolloid-like 2 precursor 502 CUB 2. cell differentiation|multicellular organismal development|proteolysis extracellular region calcium ion binding|metalloendopeptidase activity|zinc ion binding ovary(1)|pancreas(1)|skin(1) 3 Colorectal(252;0.0846) Epithelial(162;1.51e-07)|all cancers(201;7.59e-06) GTAAGTCCCACGTGAAACCCC 0.498 OREG0020398 type=REGULATORY REGION|TFbs=CTCF|Dataset=CTCF ChIP-chip sites (Ren lab)|EvidenceSubtype=ChIP-on-chip (ChIP-chip) 14 | CHUK 1147 broad.mit.edu 37 10 101960490 101960490 + Silent SNP A G G TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr10:101960490A>G uc001kqp.2 - 15 1672 c.1617T>C c.(1615-1617)GCT>GCC p.A539A NM_001278 NP_001269 O15111 IKKA_HUMAN conserved helix-loop-helix ubiquitous kinase 539 I-kappaB phosphorylation|innate immune response|MyD88-dependent toll-like receptor signaling pathway|MyD88-independent toll-like receptor signaling pathway|nerve growth factor receptor signaling pathway|phosphatidylinositol-mediated signaling|positive regulation of I-kappaB kinase/NF-kappaB cascade|positive regulation of NF-kappaB transcription factor activity|T cell receptor signaling pathway|Toll signaling pathway|toll-like receptor 1 signaling pathway|toll-like receptor 2 signaling pathway|toll-like receptor 3 signaling pathway|toll-like receptor 4 signaling pathway CD40 receptor complex|cytosol|internal side of plasma membrane|nucleus ATP binding|identical protein binding|IkappaB kinase activity ovary(2)|central_nervous_system(2)|large_intestine(1)|lung(1)|breast(1) 7 Colorectal(252;0.117) Epithelial(162;2.05e-10)|all cancers(201;1.91e-08) CCATGATTTCAGCATGCAAAG 0.413 15 | MYO7A 4647 broad.mit.edu 37 11 76901767 76901767 + Missense_Mutation SNP T C C TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr11:76901767T>C uc001oyb.2 + 30 4048 c.3776T>C c.(3775-3777)ATG>ACG p.M1259T MYO7A_uc010rsm.1_Missense_Mutation_p.M1248T|MYO7A_uc001oyc.2_Missense_Mutation_p.M1259T|MYO7A_uc009yus.1_RNA|MYO7A_uc009yut.1_Missense_Mutation_p.M470T NM_000260 NP_000251 Q13402 MYO7A_HUMAN myosin VIIA isoform 1 1259 FERM 1. actin filament-based movement|equilibrioception|lysosome organization|sensory perception of sound|visual perception cytosol|lysosomal membrane|myosin complex|photoreceptor inner segment|photoreceptor outer segment|synapse actin binding|ATP binding|calmodulin binding|microfilament motor activity ovary(3)|breast(1) 4 AAGCCAATCATGTTGCCCGTG 0.597 16 | C12orf35 55196 broad.mit.edu 37 12 32135884 32135884 + Missense_Mutation SNP C G G TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr12:32135884C>G uc001rks.2 + 4 2409 c.1995C>G c.(1993-1995)GAC>GAG p.D665E NM_018169 NP_060639 Q9HCM1 CL035_HUMAN hypothetical protein LOC55196 665 ovary(1)|skin(1) 2 all_cancers(9;3.36e-11)|all_epithelial(9;2.56e-11)|all_lung(12;5.67e-10)|Acute lymphoblastic leukemia(23;0.0122)|Lung SC(12;0.0336)|all_hematologic(23;0.0429)|Esophageal squamous(101;0.204) OV - Ovarian serous cystadenocarcinoma(6;0.0114) CTAAAAGTGACAGTAGCTGTT 0.423 17 | ABCD2 225 broad.mit.edu 37 12 40013182 40013182 + Missense_Mutation SNP C G G TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr12:40013182C>G uc001rmb.2 - 1 662 c.236G>C c.(235-237)GGA>GCA p.G79A NM_005164 NP_005155 Q9UBJ2 ABCD2_HUMAN ATP-binding cassette, sub-family D, member 2 79 Interaction with PEX19. fatty acid metabolic process|transport ATP-binding cassette (ABC) transporter complex|integral to plasma membrane|peroxisomal membrane ATP binding|ATPase activity|protein binding ovary(2)|upper_aerodigestive_tract(1)|pancreas(1)|central_nervous_system(1)|skin(1) 6 TGCATTCACTCCAGGCGAAGG 0.463 18 | OR6C2 341416 broad.mit.edu 37 12 55846834 55846834 + Silent SNP C T T TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr12:55846834C>T uc001sgz.1 + 1 837 c.837C>T c.(835-837)GTC>GTT p.V279V NM_054105 NP_473446 Q9NZP2 OR6C2_HUMAN olfactory receptor, family 6, subfamily C, 279 Helical; Name=7; (Potential). sensory perception of smell integral to membrane|plasma membrane olfactory receptor activity skin(2) 2 CTACTTCTGTCGCACCCTTGT 0.408 19 | LEMD3 23592 broad.mit.edu 37 12 65637180 65637180 + Missense_Mutation SNP A G G TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr12:65637180A>G uc001ssl.1 + 10 2324 c.2318A>G c.(2317-2319)GAT>GGT p.D773G LEMD3_uc009zqo.1_Missense_Mutation_p.D772G NM_014319 NP_055134 Q9Y2U8 MAN1_HUMAN LEM domain containing 3 773 Interaction with SMAD1, SMAD2, SMAD3 and SMAD5. negative regulation of activin receptor signaling pathway|negative regulation of BMP signaling pathway|negative regulation of transforming growth factor beta receptor signaling pathway integral to nuclear inner membrane|membrane fraction DNA binding|nucleotide binding|protein binding central_nervous_system(3)|ovary(1) 4 LUAD - Lung adenocarcinoma(6;0.0234)|LUSC - Lung squamous cell carcinoma(43;0.0975) GBM - Glioblastoma multiforme(28;0.0104) TTTCATTTAGATAGAAGAAAT 0.279 20 | IKBIP 121457 broad.mit.edu 37 12 99007867 99007867 + Silent SNP T C C TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr12:99007867T>C uc001tfv.2 - 3 659 c.549A>G c.(547-549)TCA>TCG p.S183S IKBIP_uc001tfw.2_3'UTR NM_201612 NP_963906 Q70UQ0 IKIP_HUMAN IKK interacting protein isoform 2 183 induction of apoptosis|response to X-ray endoplasmic reticulum membrane|integral to membrane protein binding 0 TTACTAAACCTGAAATCCGTC 0.308 21 | ACADS 35 broad.mit.edu 37 12 121176677 121176677 + Missense_Mutation SNP C T T rs140853839 byFrequency TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr12:121176677C>T uc001tza.3 + 8 1106 c.988C>T c.(988-990)CGC>TGC p.R330C ACADS_uc010szl.1_Missense_Mutation_p.R326C|ACADS_uc001tzb.3_Missense_Mutation_p.R211C NM_000017 NP_000008 P16219 ACADS_HUMAN short-chain acyl-CoA dehydrogenase precursor 330 mitochondrial matrix butyryl-CoA dehydrogenase activity central_nervous_system(2) 2 all_neural(191;0.0684)|Medulloblastoma(191;0.0922) Lung NSC(355;0.163) NADH(DB00157) GCTGACCTGGCGCGCTGCCAT 0.637 22 | MMP14 4323 broad.mit.edu 37 14 23312494 23312494 + Silent SNP C T T TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr14:23312494C>T uc001whc.2 + 5 951 c.717C>T c.(715-717)CAC>CAT p.H239H NM_004995 NP_004986 P50281 MMP14_HUMAN matrix metalloproteinase 14 preproprotein 239 Extracellular (Potential). Zinc; catalytic. extracellular matrix|integral to plasma membrane|melanosome calcium ion binding|metalloendopeptidase activity|zinc ion binding 0 all_cancers(95;9.47e-05) GBM - Glioblastoma multiforme(265;0.00551) TGGCTGTGCACGAGCTGGGCC 0.602 23 | TMC7 79905 broad.mit.edu 37 16 19073157 19073157 + Missense_Mutation SNP A T T TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr16:19073157A>T uc002dfq.2 + 16 2294 c.2164A>T c.(2164-2166)AGG>TGG p.R722W TMC7_uc010vap.1_Missense_Mutation_p.R612W NM_024847 NP_079123 Q7Z402 TMC7_HUMAN transmembrane channel-like 7 isoform a 722 Cytoplasmic (Potential). integral to membrane skin(2)|ovary(1) 3 AAGGGACATGAGGAACTAACT 0.418 24 | ULK2 9706 broad.mit.edu 37 17 19699577 19699577 + Missense_Mutation SNP T G G TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr17:19699577T>G uc002gwm.3 - 19 2337 c.1828A>C c.(1828-1830)ATC>CTC p.I610L ULK2_uc002gwn.2_Missense_Mutation_p.I610L NM_001142610 NP_001136082 Q8IYT8 ULK2_HUMAN unc-51-like kinase 2 610 signal transduction ATP binding|protein binding|protein serine/threonine kinase activity skin(2)|large_intestine(1)|stomach(1) 4 all_cancers(12;4.97e-05)|all_epithelial(12;0.00362)|Breast(13;0.186) GTTTTAGGGATTTTGAAAGGA 0.413 25 | CNTNAP1 8506 broad.mit.edu 37 17 40847561 40847561 + Silent SNP G A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr17:40847561G>A uc002iay.2 + 19 3231 c.3015G>A c.(3013-3015)CCG>CCA p.P1005P CNTNAP1_uc010wgs.1_RNA NM_003632 NP_003623 P78357 CNTP1_HUMAN contactin associated protein 1 precursor 1005 Extracellular (Potential). axon guidance|cell adhesion paranode region of axon receptor activity|receptor binding|SH3 domain binding|SH3/SH2 adaptor activity ovary(3)|breast(3)|upper_aerodigestive_tract(1)|lung(1) 8 Breast(137;0.000143) BRCA - Breast invasive adenocarcinoma(366;0.143) TCTTTGAGCCGGGCACCTGGA 0.567 26 | TBCD 6904 broad.mit.edu 37 17 80842049 80842049 + Nonsense_Mutation SNP C T T TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr17:80842049C>T uc002kfz.2 + 15 1634 c.1504C>T c.(1504-1506)CGA>TGA p.R502* TBCD_uc002kfx.1_Nonsense_Mutation_p.R485*|TBCD_uc002kfy.1_Nonsense_Mutation_p.R502* NM_005993 NP_005984 Q9BTW9 TBCD_HUMAN beta-tubulin cofactor D 502 'de novo' posttranslational protein folding|adherens junction assembly|negative regulation of cell-substrate adhesion|negative regulation of microtubule polymerization|post-chaperonin tubulin folding pathway|tight junction assembly adherens junction|cytoplasm|lateral plasma membrane|microtubule|tight junction beta-tubulin binding|chaperone binding|GTPase activator activity 0 Breast(20;0.000523)|all_neural(118;0.0779) all_cancers(8;0.0266)|all_epithelial(8;0.0696) OV - Ovarian serous cystadenocarcinoma(97;0.0868)|BRCA - Breast invasive adenocarcinoma(99;0.18) GGTGTTTGACCGAGACATAAA 0.443 27 | ZNF492 57615 broad.mit.edu 37 19 22846757 22846757 + Nonsense_Mutation SNP G T T rs112130958 TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr19:22846757G>T uc002nqw.3 + 4 530 c.286G>T c.(286-288)GAA>TAA p.E96* NM_020855 NP_065906 Q9P255 ZN492_HUMAN zinc finger protein 492 96 regulation of transcription, DNA-dependent|transcription, DNA-dependent nucleus DNA binding|zinc ion binding 0 all_cancers(12;0.0266)|all_lung(12;0.00187)|Lung NSC(12;0.0019)|all_epithelial(12;0.00203)|Hepatocellular(1079;0.244) GGTGCACAAAGAATGTTACAA 0.299 28 | CEACAM5 1048 broad.mit.edu 37 19 42224052 42224052 + Missense_Mutation SNP G A A rs138799075 byFrequency TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr19:42224052G>A uc002ork.2 + 7 1817 c.1696G>A c.(1696-1698)GCA>ACA p.A566T CEACAM5_uc002orj.1_Missense_Mutation_p.A565T|CEACAM5_uc002orl.2_Missense_Mutation_p.A566T NM_004363 NP_004354 P06731 CEAM5_HUMAN carcinoembryonic antigen-related cell adhesion 566 Ig-like 6. anchored to membrane|basolateral plasma membrane|integral to plasma membrane skin(2) 2 OV - Ovarian serous cystadenocarcinoma(3;0.00278)|all cancers(3;0.00625)|Epithelial(262;0.0379)|GBM - Glioblastoma multiforme(1328;0.142) AAGAAATGACGCAAGAGCCTA 0.522 29 | KLK11 11012 broad.mit.edu 37 19 51528895 51528895 + Missense_Mutation SNP A G G TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr19:51528895A>G uc002pvd.1 - 2 201 c.89T>C c.(88-90)CTC>CCC p.L30P KLK11_uc002pvb.1_5'UTR|KLK11_uc002pve.1_5'UTR|KLK11_uc002pvf.1_5'UTR|KLK11_uc002pvc.3_5'UTR|KLK11_uc010eom.2_5'UTR NM_144947 NP_659196 Q9UBX7 KLK11_HUMAN kallikrein 11 isoform 2 precursor 30 proteolysis extracellular region serine-type endopeptidase activity 0 all_neural(266;0.026) OV - Ovarian serous cystadenocarcinoma(262;0.00327)|GBM - Glioblastoma multiforme(134;0.00878) CATGGCCTGGAGGGGGGAGGA 0.627 30 | LILRB2 10288 broad.mit.edu 37 19 54783717 54783717 + Missense_Mutation SNP C T T rs145209585 byFrequency TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr19:54783717C>T uc002qfb.2 - 4 550 c.284G>A c.(283-285)CGA>CAA p.R95Q LILRA6_uc002qew.1_Intron|LILRB2_uc010eri.2_Missense_Mutation_p.R95Q|LILRB2_uc010erj.2_RNA|LILRB2_uc002qfc.2_Missense_Mutation_p.R95Q|LILRB2_uc010yet.1_5'UTR|LILRB2_uc010yeu.1_RNA NM_005874 NP_005865 Q8N423 LIRB2_HUMAN leukocyte immunoglobulin-like receptor, 95 Extracellular (Potential).|Ig-like C2-type 1. cell surface receptor linked signaling pathway|cell-cell signaling|cellular defense response|immune response|regulation of immune response integral to plasma membrane|membrane fraction receptor activity skin(1) 1 Ovarian(34;0.19) GBM - Glioblastoma multiforme(193;0.105) ACAGCCATATCGCCCTGTGTG 0.557 31 | HEATR5B 54497 broad.mit.edu 37 2 37295836 37295836 + Missense_Mutation SNP T C C TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr2:37295836T>C uc002rpp.1 - 8 1261 c.1165A>G c.(1165-1167)ATG>GTG p.M389V NM_019024 NP_061897 Q9P2D3 HTR5B_HUMAN HEAT repeat containing 5B 389 binding ovary(5)|skin(2)|breast(1) 8 all_hematologic(82;0.21) ACGGCTTTCATTTGTTTTCCA 0.353 32 | EIF5B 9669 broad.mit.edu 37 2 99977775 99977777 + In_Frame_Del DEL TGA - - TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr2:99977775_99977777delTGA uc002tab.2 + 4 595_597 c.411_413delTGA c.(409-414)AGTGAT>AGT p.D142del NM_015904 NP_056988 O60841 IF2P_HUMAN eukaryotic translation initiation factor 5B 142 Poly-Asp. regulation of translational initiation cytosol GTP binding|GTPase activity|protein binding|translation initiation factor activity ovary(2)|pancreas(1) 3 ACTCTGGGAGTGATGATGATGAT 0.345 33 | KIF5C 3800 broad.mit.edu 37 2 149793797 149793797 + Splice_Site SNP G A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr2:149793797G>A uc010zbu.1 + 4 660 c.292_splice c.e4-1 p.G98_splice NM_004522 NP_004513 O60282 KIF5C_HUMAN kinesin family member 5C microtubule-based movement|organelle organization cytoplasm|kinesin complex|microtubule ATP binding|microtubule motor activity skin(1) 1 BRCA - Breast invasive adenocarcinoma(221;0.108) TCGCCCACTAGGGGAAGCTGC 0.512 34 | SIRPG 55423 broad.mit.edu 37 20 1629729 1629729 + Missense_Mutation SNP C A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr20:1629729C>A uc002wfm.1 - 2 464 c.399G>T c.(397-399)AAG>AAT p.K133N SIRPG_uc002wfn.1_Missense_Mutation_p.K133N|SIRPG_uc002wfo.1_Missense_Mutation_p.K133N NM_018556 NP_061026 Q9P1W8 SIRPG_HUMAN signal-regulatory protein gamma isoform 1 133 Extracellular (Potential).|Ig-like V-type. blood coagulation|cell adhesion|cell junction assembly|cell-cell signaling|intracellular signal transduction|leukocyte migration|negative regulation of cell proliferation|positive regulation of cell proliferation|positive regulation of cell-cell adhesion|positive regulation of T cell activation integral to membrane|intracellular|plasma membrane protein binding ovary(1) 1 CTGGTCCAGACTTAAACTCCA 0.493 35 | SIGLEC1 6614 broad.mit.edu 37 20 3673751 3673751 + Missense_Mutation SNP T C C TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr20:3673751T>C uc002wja.2 - 14 3536 c.3536A>G c.(3535-3537)TAC>TGC p.Y1179C SIGLEC1_uc002wjb.1_5'UTR|SIGLEC1_uc002wiz.3_Missense_Mutation_p.Y1179C NM_023068 NP_075556 Q9BZZ2 SN_HUMAN sialoadhesin precursor 1179 Ig-like C2-type 12.|Extracellular (Potential). cell-cell adhesion|cell-matrix adhesion|endocytosis|inflammatory response extracellular region|integral to membrane|plasma membrane sugar binding pancreas(4)|ovary(2)|skin(2)|breast(1)|central_nervous_system(1) 10 CTCCAGGAGGTAGGTCAGGCG 0.682 36 | NTSR1 4923 broad.mit.edu 37 20 61340984 61340984 + Missense_Mutation SNP G A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr20:61340984G>A uc002ydf.2 + 1 796 c.425G>A c.(424-426)CGC>CAC p.R142H NM_002531 NP_002522 P30989 NTR1_HUMAN neurotensin receptor 1 142 Extracellular (Potential). endoplasmic reticulum|Golgi apparatus|integral to plasma membrane neurotensin receptor activity, G-protein coupled skin(2)|lung(1)|central_nervous_system(1) 4 Breast(26;3.65e-08) BRCA - Breast invasive adenocarcinoma(19;3.63e-06) GCCGGCTGCCGCGGCTACTAC 0.677 37 | TBX1 6899 broad.mit.edu 37 22 19748718 19748718 + Missense_Mutation SNP G T T TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr22:19748718G>T uc002zqb.2 + 3 454 c.325G>T c.(325-327)GGT>TGT p.G109C TBX1_uc002zqa.1_Missense_Mutation_p.G109C|TBX1_uc002zqc.2_Missense_Mutation_p.G109C NM_080646 NP_542377 O43435 TBX1_HUMAN T-box 1 isoform A 109 embryonic viscerocranium morphogenesis|heart development|parathyroid gland development|pharyngeal system development|regulation of transcription from RNA polymerase II promoter|soft palate development|thymus development nucleus protein homodimerization activity|sequence-specific DNA binding|sequence-specific DNA binding transcription factor activity ovary(1)|breast(1) 2 Colorectal(54;0.0993) all_lung(157;3.05e-06) GAAGGTGGCCGGTGTGAGCGT 0.592 38 | LZTR1 8216 broad.mit.edu 37 22 21341825 21341825 + Missense_Mutation SNP G A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr22:21341825G>A uc002zto.2 + 4 456 c.353G>A c.(352-354)CGT>CAT p.R118H LZTR1_uc002ztn.2_Missense_Mutation_p.R77H|LZTR1_uc011ahy.1_Missense_Mutation_p.R99H|LZTR1_uc010gsr.1_5'UTR NM_006767 NP_006758 Q8N653 LZTR1_HUMAN leucine-zipper-like transcription regulator 1 118 Kelch 1. anatomical structure morphogenesis sequence-specific DNA binding transcription factor activity ovary(2)|lung(2) 4 all_cancers(11;1.83e-25)|all_epithelial(7;9.19e-23)|Lung NSC(8;3.06e-15)|all_lung(8;5.05e-14)|Melanoma(16;0.000465)|Ovarian(15;0.0028)|Colorectal(54;0.0332)|all_neural(72;0.142) Lung SC(17;0.0262) LUSC - Lung squamous cell carcinoma(15;0.000204)|Lung(15;0.00494)|Epithelial(17;0.195) CCGGCCCCCCGTTACCACCAC 0.662 39 | TFIP11 24144 broad.mit.edu 37 22 26890269 26890269 + Missense_Mutation SNP A C C TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr22:26890269A>C uc003acr.2 - 13 2368 c.1994T>G c.(1993-1995)GTG>GGG p.V665G TFIP11_uc003acq.2_Missense_Mutation_p.V24G|TFIP11_uc003acs.2_Missense_Mutation_p.V665G|TFIP11_uc003act.2_Missense_Mutation_p.V665G|uc003acu.1_RNA NM_012143 NP_036275 Q9UBB9 TFP11_HUMAN tuftelin interacting protein 11 665 biomineral tissue development catalytic step 2 spliceosome|cytoplasm|nuclear speck DNA binding|sequence-specific DNA binding transcription factor activity 0 AGAGCACAGCACCTGCCAAAA 0.463 40 | NEFH 4744 broad.mit.edu 37 22 29886360 29886360 + Missense_Mutation SNP C A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr22:29886360C>A uc003afo.2 + 4 2802 c.2731C>A c.(2731-2733)CCT>ACT p.P911T NEFH_uc003afp.2_5'UTR NM_021076 NP_066554 P12036 NFH_HUMAN neurofilament, heavy polypeptide 200kDa 917 Tail. cell death|nervous system development neurofilament 0 GAAGGAGGCTCCTGCCAAGGT 0.502 41 | DEPDC5 9681 broad.mit.edu 37 22 32275577 32275577 + Missense_Mutation SNP G A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr22:32275577G>A uc003als.2 + 37 3921 c.3779G>A c.(3778-3780)CGC>CAC p.R1260H DEPDC5_uc011als.1_Missense_Mutation_p.R1191H|DEPDC5_uc011alu.1_Missense_Mutation_p.R1291H|DEPDC5_uc011alv.1_RNA|DEPDC5_uc003alt.2_Missense_Mutation_p.R1282H|DEPDC5_uc003alu.2_Missense_Mutation_p.R709H|DEPDC5_uc003alv.2_RNA|DEPDC5_uc003alw.2_Missense_Mutation_p.R558H|DEPDC5_uc011alx.1_Missense_Mutation_p.R108H|DEPDC5_uc010gwk.2_Missense_Mutation_p.R286H|DEPDC5_uc011aly.1_Missense_Mutation_p.R108H NM_014662 NP_055477 O75140 DEPD5_HUMAN DEP domain containing 5 isoform 1 1260 intracellular signal transduction ovary(4)|central_nervous_system(3)|pancreas(1) 8 AGCTTCCAGCGCAAGTGGTTT 0.607 42 | STXBP5L 9515 broad.mit.edu 37 3 120871386 120871386 + Silent SNP A G G TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr3:120871386A>G uc003eec.3 + 8 872 c.732A>G c.(730-732)GAA>GAG p.E244E STXBP5L_uc011bji.1_Silent_p.E244E NM_014980 NP_055795 Q9Y2K9 STB5L_HUMAN syntaxin binding protein 5-like 244 WD 4. exocytosis|protein transport cytoplasm|integral to membrane|plasma membrane ovary(7)|skin(2) 9 GBM - Glioblastoma multiforme(114;0.0694) AAAGAGCAGAACTGAGAGTTT 0.333 43 | PEX5L 51555 broad.mit.edu 37 3 179616029 179616029 + Frame_Shift_Del DEL T - - TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr3:179616029delT uc003fki.1 - 3 229 c.99delA c.(97-99)AAAfs p.K33fs PEX5L_uc011bqd.1_5'UTR|PEX5L_uc011bqe.1_Intron|PEX5L_uc011bqf.1_5'UTR|PEX5L_uc003fkj.1_Intron|PEX5L_uc010hxd.1_Frame_Shift_Del_p.K31fs|PEX5L_uc011bqg.1_Frame_Shift_Del_p.K9fs|PEX5L_uc011bqh.1_Intron NM_016559 NP_057643 Q8IYB4 PEX5R_HUMAN peroxisomal biogenesis factor 5-like 33 protein import into peroxisome matrix|regulation of cAMP-mediated signaling cytosol|peroxisomal membrane peroxisome matrix targeting signal-1 binding ovary(3)|large_intestine(1) 4 all_cancers(143;3.94e-14)|Ovarian(172;0.0338)|Breast(254;0.183) OV - Ovarian serous cystadenocarcinoma(80;1.75e-26)|GBM - Glioblastoma multiforme(14;0.000518) CCCTAGAGCCTTTTCCCTATA 0.413 44 | C3orf59 151963 broad.mit.edu 37 3 192517421 192517421 + Missense_Mutation SNP T C C rs117555490 by1000genomes TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr3:192517421T>C uc011bsp.1 - 2 551 c.230A>G c.(229-231)GAC>GGC p.D77G NM_178496 NP_848591 Q8IYB1 M21D2_HUMAN hypothetical protein LOC151963 77 0 all_cancers(143;1.56e-08)|Ovarian(172;0.0634) OV - Ovarian serous cystadenocarcinoma(49;2.8e-18)|LUSC - Lung squamous cell carcinoma(58;8.04e-06)|Lung(62;8.62e-06) GBM - Glioblastoma multiforme(46;3.86e-05) AAGCTTTTGGTCCAGCTTTTG 0.443 45 | NKX3-2 579 broad.mit.edu 37 4 13546023 13546023 + Missense_Mutation SNP C T T TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr4:13546023C>T uc003gmx.2 - 1 92 c.16G>A c.(16-18)GCC>ACC p.A6T NM_001189 NP_001180 P78367 NKX32_HUMAN NK3 homeobox 2 6 negative regulation of chondrocyte differentiation|transcription from RNA polymerase II promoter nucleus sequence-specific DNA binding|sequence-specific DNA binding transcription factor activity 0 AAGGTGTTGGCGCCGCGCACA 0.557 46 | GEMIN5 25929 broad.mit.edu 37 5 154275813 154275813 + Missense_Mutation SNP G C C TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr5:154275813G>C uc003lvx.3 - 24 3519 c.3436C>G c.(3436-3438)CAC>GAC p.H1146D GEMIN5_uc011ddk.1_Missense_Mutation_p.H1145D NM_015465 NP_056280 Q8TEQ6 GEMI5_HUMAN gemin 5 1146 ncRNA metabolic process|protein complex assembly|spliceosomal snRNP assembly Cajal body|cytosol|spliceosomal complex protein binding|snRNA binding skin(2)|ovary(1) 3 Renal(175;0.00488) Medulloblastoma(196;0.0354)|all_neural(177;0.147) KIRC - Kidney renal clear cell carcinoma(527;0.00112) TTCCAAGTGTGGTAAGAGGAG 0.547 47 | AGXT2L2 85007 broad.mit.edu 37 5 177649920 177649920 + Missense_Mutation SNP C T T rs142142484 byFrequency TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr5:177649920C>T uc003miz.2 - 7 886 c.634G>A c.(634-636)GCT>ACT p.A212T AGXT2L2_uc003miy.2_5'UTR|AGXT2L2_uc003mjc.2_Missense_Mutation_p.A171T|AGXT2L2_uc003mja.2_RNA|AGXT2L2_uc003mjb.2_5'UTR|AGXT2L2_uc003mjd.1_Missense_Mutation_p.A70T NM_153373 NP_699204 Q8IUZ5 AT2L2_HUMAN alanine-glyoxylate aminotransferase 2-like 2 212 mitochondrion pyridoxal phosphate binding|transaminase activity pancreas(1) 1 all_cancers(89;0.00185)|Renal(175;0.000269)|Lung NSC(126;0.00858)|all_lung(126;0.0139) all_neural(177;0.00802)|Medulloblastoma(196;0.0145)|all_hematologic(541;0.248) Kidney(164;2.23e-05)|KIRC - Kidney renal clear cell carcinoma(164;0.000178) GBM - Glioblastoma multiforme(465;0.181)|all cancers(165;0.235) L-Alanine(DB00160)|Pyridoxal Phosphate(DB00114) AGAGACTCAGCGAAGAAGGCT 0.587 48 | BMP6 654 broad.mit.edu 37 6 7727630 7727630 + Missense_Mutation SNP G A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr6:7727630G>A uc003mxu.3 + 1 620 c.442G>A c.(442-444)GAC>AAC p.D148N NM_001718 NP_001709 P22004 BMP6_HUMAN bone morphogenetic protein 6 preproprotein 148 BMP signaling pathway|cartilage development|growth|immune response|positive regulation of aldosterone biosynthetic process|positive regulation of bone mineralization|positive regulation of osteoblast differentiation|positive regulation of pathway-restricted SMAD protein phosphorylation|positive regulation of transcription from RNA polymerase II promoter|SMAD protein signal transduction extracellular space BMP receptor binding|cytokine activity|growth factor activity|protein heterodimerization activity large_intestine(2)|ovary(1) 3 Ovarian(93;0.0721) CGCCGACAACGACGAGGACGG 0.682 49 | NFKBIE 4794 broad.mit.edu 37 6 44229437 44229437 + Missense_Mutation SNP C A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr6:44229437C>A uc003oxe.1 - 3 1059 c.1034G>T c.(1033-1035)TGC>TTC p.C345F NM_004556 NP_004547 O00221 IKBE_HUMAN nuclear factor of kappa light polypeptide gene 345 ANK 3. cytoplasmic sequestering of transcription factor protein binding breast(2) 2 all_cancers(18;2e-05)|all_lung(25;0.00747)|Hepatocellular(11;0.00908)|Ovarian(13;0.0273) Colorectal(64;0.00337)|COAD - Colon adenocarcinoma(64;0.00536) TTCCAGCAGGCAGCGGGCACA 0.632 50 | COL19A1 1310 broad.mit.edu 37 6 70589454 70589454 + Translation_Start_Site SNP G T T TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr6:70589454G>T uc003pfc.1 + 2 112 c.-5G>T c.(-7--3)AAGGC>AATGC NM_001858 NP_001849 Q14993 COJA1_HUMAN alpha 1 type XIX collagen precursor cell differentiation|cell-cell adhesion|extracellular matrix organization|skeletal system development collagen extracellular matrix structural constituent|protein binding, bridging ovary(2)|breast(2) 4 ATGGTTTCAAGGCACAATGAG 0.418 51 | RFPL4B 442247 broad.mit.edu 37 6 112671523 112671523 + Missense_Mutation SNP C T T rs143103700 byFrequency TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr6:112671523C>T uc003pvx.1 + 3 925 c.613C>T c.(613-615)CGC>TGC p.R205C NM_001013734 NP_001013756 Q6ZWI9 RFPLB_HUMAN ret finger protein-like 4B 205 B30.2/SPRY. zinc ion binding 0 all_cancers(87;9.44e-05)|all_hematologic(75;0.000114)|all_epithelial(87;0.00265)|Colorectal(196;0.0209) all cancers(137;0.0202)|OV - Ovarian serous cystadenocarcinoma(136;0.0477)|Epithelial(106;0.0646)|GBM - Glioblastoma multiforme(226;0.0866)|BRCA - Breast invasive adenocarcinoma(108;0.244) CCCTCGCCTTCGCCGTGTGGG 0.448 52 | DSE 29940 broad.mit.edu 37 6 116757341 116757341 + Missense_Mutation SNP C A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr6:116757341C>A uc003pws.2 + 6 1904 c.1710C>A c.(1708-1710)GAC>GAA p.D570E DSE_uc011ebg.1_Missense_Mutation_p.D589E|DSE_uc003pwt.2_Missense_Mutation_p.D570E|DSE_uc003pwu.2_Missense_Mutation_p.D237E NM_001080976 NP_001074445 Q9UL01 DSE_HUMAN dermatan sulfate epimerase precursor 570 dermatan sulfate biosynthetic process endoplasmic reticulum|Golgi apparatus|integral to membrane chondroitin-glucuronate 5-epimerase activity ovary(1) 1 all_cancers(87;0.00019)|all_epithelial(87;0.000416)|Ovarian(999;0.133)|Colorectal(196;0.234) Epithelial(106;0.00915)|OV - Ovarian serous cystadenocarcinoma(136;0.0149)|GBM - Glioblastoma multiforme(226;0.0189)|all cancers(137;0.0262) TCCTTGTAGACCAAATACACC 0.502 53 | CLIP2 7461 broad.mit.edu 37 7 73771699 73771699 + Silent SNP G A A TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr7:73771699G>A uc003uam.2 + 6 1434 c.1107G>A c.(1105-1107)GAG>GAA p.E369E CLIP2_uc003uan.2_Silent_p.E369E NM_003388 NP_003379 Q9UDT6 CLIP2_HUMAN CAP-GLY domain containing linker protein 2 369 Potential. microtubule associated complex skin(3) 3 AGCACATTGAGCAGCTGCTGG 0.617 54 | PRUNE2 158471 broad.mit.edu 37 9 79321219 79321219 + Missense_Mutation SNP C G G TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr9:79321219C>G uc010mpk.2 - 8 6095 c.5971G>C c.(5971-5973)GAA>CAA p.E1991Q PRUNE2_uc004akj.3_5'Flank|PRUNE2_uc010mpl.1_5'Flank NM_015225 NP_056040 Q8WUY3 PRUN2_HUMAN prune homolog 2 1991 apoptosis|G1 phase|induction of apoptosis cytoplasm metal ion binding|pyrophosphatase activity 0 TCTTGACCTTCATTAGTTGAA 0.423 55 | DAPK1 1612 broad.mit.edu 37 9 90266587 90266587 + Missense_Mutation SNP C T T rs36214022 TCGA-32-4209-01A-01D-1353-08 TCGA-32-4209-10A-01D-1353-08 Somatic Phase_I Capture Illumina GAIIx 0c30ef40-b943-4281-84d7-8d574882abd4 77169aad-6bd8-4b1b-bb48-c02960d41ea0 g.chr9:90266587C>T uc004apc.2 + 17 1910 c.1772C>T c.(1771-1773)CCT>CTT p.P591L DAPK1_uc004apd.2_Missense_Mutation_p.P591L|DAPK1_uc011ltg.1_Missense_Mutation_p.P591L|DAPK1_uc011lth.1_Missense_Mutation_p.P328L|DAPK1_uc004apf.1_Missense_Mutation_p.P145L NM_004938 NP_004929 P53355 DAPK1_HUMAN death-associated protein kinase 1 591 ANK 7. P -> L. apoptosis|induction of apoptosis by extracellular signals|intracellular protein kinase cascade actin cytoskeleton|cytoplasm ATP binding|calmodulin binding|protein serine/threonine kinase activity ovary(1)|breast(1) 2 GGCAACATGCCTATCGTGGTG 0.498 Chronic_Lymphocytic_Leukemia_Familial_Clustering_of 56 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # STAT115 Homeworks 2 | 3 | 4 | --------------------------------------------------------------------------------