├── .gitignore
├── HW1
    ├── Homework1.Rmd
    ├── Homework1.html
    └── README.md
├── HW2
    ├── code
    │   └── Homework2.Rmd
    ├── data
    │   ├── part3_counts.rds
    │   ├── part5
    │   │   ├── .Rhistory
    │   │   ├── A.geneBodyCoverage.r
    │   │   ├── B.geneBodyCoverage.r
    │   │   ├── C.geneBodyCoverage.r
    │   │   ├── D.geneBodyCoverage.r
    │   │   ├── E.geneBodyCoverage.r
    │   │   ├── F.geneBodyCoverage.r
    │   │   ├── G.geneBodyCoverage.r
    │   │   ├── H.geneBodyCoverage.r
    │   │   ├── J.geneBodyCoverage.r
    │   │   ├── K.geneBodyCoverage.r
    │   │   ├── L.geneBodyCoverage.r
    │   │   └── M.geneBodyCoverage.r
    │   └── part5_example.pdf
    └── papers
    │   ├── part3_4-manuscript.pdf
    │   └── part6
    │       ├── 1_original.pdf
    │       ├── 2_Hartl_response.pdf
    │       └── 3_response_to_Hartl.pdf
├── HW3
    ├── Homework3_release.Rmd
    └── q1_data
    │   ├── BRCA_phenotype.txt
    │   ├── BRCA_zscore_data.txt
    │   ├── diagnosis.txt
    │   └── unknown_samples.txt
├── HW4
    ├── README.md
    └── Stat115_Homework4.Rmd
├── HW5
    ├── code
    │   └── STAT115_HW5_2020.Rmd
    ├── data
    │   └── HW5_ESC.Dixon_2015.DI.chr21.txt
    └── papers
    │   ├── PMID23001124.pdf
    │   ├── Supplement_10.1038_nature11082.pdf
    │   └── nejmoa2002032.pdf
├── HW6
    ├── README.md
    ├── Stat115_Homework6.Rmd
    └── data
    │   ├── GBM_clin.txt
    │   ├── GBM_expr.txt
    │   ├── GBM_meth.txt
    │   ├── TCGA-02-2483.maf.txt
    │   ├── TCGA-06-0124.maf.txt
    │   ├── TCGA-06-0128.maf.txt
    │   ├── TCGA-06-0129.maf.txt
    │   ├── TCGA-06-0210.maf.txt
    │   ├── TCGA-06-2570.maf.txt
    │   ├── TCGA-06-5410.maf.txt
    │   ├── TCGA-06-5412.maf.txt
    │   ├── TCGA-06-5417.maf.txt
    │   ├── TCGA-06-6389.maf.txt
    │   ├── TCGA-14-1456.maf.txt
    │   ├── TCGA-14-4157.maf.txt
    │   ├── TCGA-19-1790.maf.txt
    │   ├── TCGA-19-2629.maf.txt
    │   ├── TCGA-26-1442.maf.txt
    │   ├── TCGA-26-5133.maf.txt
    │   ├── TCGA-27-2521.maf.txt
    │   ├── TCGA-28-5209.maf.txt
    │   ├── TCGA-28-5218.maf.txt
    │   ├── TCGA-32-4208.maf.txt
    │   ├── TCGA-32-4209.maf.txt
    │   ├── TCGA-32-4213.maf.txt
    │   └── TCGA-41-3393.maf.txt
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | *.Rhistory
2 | 


--------------------------------------------------------------------------------
/HW1/Homework1.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "STAT115 Homework 1"
  3 | author: ""
  4 | date: "February 10, 2020"
  5 | output: html_document
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE, eval = TRUE)
 10 | ```
 11 | 
 12 | 
 13 | # Part 0: Odyssey Signup
 14 | 
 15 | Please fill out the Odyssey survey on Canvas so we can create an account for you
 16 | 
 17 | # Part I: Introduction to R
 18 | 
 19 | ## Problem 1: Installation
 20 | 
 21 | **Please install the following R/Bioconductor packages**
 22 | 
 23 | ```{r install, eval = FALSE}
 24 | if (!requireNamespace("BiocManager", quietly = TRUE))
 25 |     install.packages("BiocManager")
 26 | BiocManager::install()
 27 | BiocManager::install("sva")
 28 | 
 29 | install.packages(c("ggplot2", "dplyr", "tidyr", "HistData", "mvtnorm",
 30 |                    "reticulate"))
 31 | ```
 32 | 
 33 | 
 34 | Please run this command (use `eval=TRUE`) to see if Bioconductor can work fine.
 35 | 
 36 | ```{r, eval=FALSE}
 37 | BiocManager::valid() 
 38 | ```
 39 | 
 40 | 
 41 | ```{r libraries, message = FALSE}
 42 | # these packages are needed for HW2
 43 | # affy and affyPLM are needed to read the microarray data and run RMA
 44 | library(sva) # for batch effect correction. Contains ComBat and sva.
 45 | library(ggplot2) # for plotting
 46 | library(dplyr) # for data manipulation
 47 | library(reticulate) # needed to run python in Rstudio
 48 | # these next two are not essential to this course
 49 | library(mvtnorm) # need this to simulate data from multivariate normal
 50 | library(HistData) # need this for data
 51 | ```
 52 | 
 53 | 
 54 | ## Problem 2: Getting help
 55 | 
 56 | You can use the `mean()` function to compute the mean of a vector like
 57 | so:
 58 | 
 59 | ```{r mean}
 60 | x1 <- c(1:10, 50)
 61 | mean(x1)
 62 | ```
 63 | 
 64 | However, this does not work if the vector contains NAs:
 65 | 
 66 | ```{r mean-na}
 67 | x1_na <- c(1:10, 50, NA)
 68 | mean(x1_na)
 69 | ```
 70 | 
 71 | **Please use R documentation to find the mean after excluding NA's (hint: `?mean`)**
 72 | 
 73 | ```{r problem2}
 74 | # your code here
 75 | ```
 76 | 
 77 | 
 78 | Grading: Grade on correctness.
 79 | 
 80 | + 0.5pt
 81 | 
 82 | # Part II: Data Manipulation
 83 | 
 84 | ## Problem 3: Basic Selection
 85 | 
 86 | In this question, we will practice data manipulation using a dataset
 87 | collected by Francis Galton in 1886 on the heights of parents and their
 88 | children. This is a very famous dataset, and Galton used it to come up
 89 | with regression and correlation.
 90 | 
 91 | The data is available as `GaltonFamilies` in the `HistData` package.
 92 | Here, we load the data and show the first few rows. To find out more
 93 | information about the dataset, use `?GaltonFamilies`.
 94 | 
 95 | ```{r loadGalton}
 96 | data(GaltonFamilies)
 97 | head(GaltonFamilies)
 98 | ```
 99 | 
100 | a. **Please report the height of the 10th child in the dataset.**
101 | 
102 | ```{r problem3a}
103 | # your code here
104 | ```
105 | 
106 | b. **What is the breakdown of male and female children in the dataset?**
107 | 
108 | ```{r problem3b}
109 | # your code here
110 | ```
111 | 
112 | c. **How many observations (number of rows) are in Galton's dataset? Please answer this
113 | question without consulting the R help.**
114 | 
115 | ```{r problem3c}
116 | # your code here
117 | ```
118 | 
119 | d. **What is the mean height for the 1st child in each family?**
120 | 
121 | ```{r problem3d}
122 | # your code here
123 | ```
124 | 
125 | e. **Create a table showing the mean height for male and female children.**
126 | ```{r problem3e}
127 | # your code here
128 | ```
129 | 
130 | f. **What was the average number of children each family had?**
131 | 
132 | ```{r problem3f}
133 | # your code here
134 | ```
135 | 
136 | g. **Convert the children's heights from inches to centimeters and store
137 | it in a column called `childHeight_cm` in the `GaltonFamilies` dataset.
138 | Show the first few rows of this dataset.**
139 | 
140 | ```{r problem3g}
141 | # your code here
142 | ```
143 | 
144 | 
145 | 
146 | ## Problem 4: Spurious Correlation
147 | 
148 | ```{r gen-data-spurious, cache = TRUE, eval=TRUE}
149 | # set seed for reproducibility
150 | set.seed(1234)
151 | N <- 25
152 | ngroups <- 100000
153 | sim_data <- data.frame(group = rep(1:ngroups, each = N),
154 |                        X = rnorm(N * ngroups),
155 |                        Y = rnorm(N * ngroups))
156 | ```
157 | 
158 | In the code above, we generate `r ngroups` groups of `r N` observations
159 | each. In each group, we have X and Y, where X and Y are independent
160 | normally distributed data and have 0 correlation.
161 | 
162 | a. **Find the correlation between X and Y for each group, and display
163 | the highest correlations.**
164 | 
165 | Hint: since the data is quite large and your code might take a few
166 | moments to run, you can test your code on a subset of the data first
167 | (e.g. you can take the first 100 groups like so):
168 | 
169 | ```{r subset}
170 | ```
171 | 
172 | In general, this is good practice whenever you have a large dataset:
173 | If you are writing new code and it takes a while to run on the whole
174 | dataset, get it to work on a subset first. By running on a subset, you
175 | can iterate faster.
176 | 
177 | However, please do run your final code on the whole dataset.
178 | 
179 | ```{r cor, cache = TRUE}
180 | # your code here
181 | ```
182 | 
183 | b. **The highest correlation is around 0.8. Can you explain why we see
184 | such a high correlation when X and Y are supposed to be independent and
185 | thus uncorrelated?**
186 | 
187 | Because we cherrypicked the highest correlations among 100,000
188 | correlations, it is just by chance that we found a few with such
189 | a high correlation. We can see in the histogram below that most of
190 | the correlations are around the expected value of 0.
191 | 
192 | ```{r cor-hist, eval=F}
193 | ```
194 | 
195 | 
196 | # Part III: Plotting
197 | 
198 | ## Problem 5
199 | 
200 | **Show a plot of the data for the group that had the highest correlation
201 | you found in Problem 4.**
202 | 
203 | ```{r problem5}
204 | # your code here
205 | ```
206 | 
207 | Grading: 1pt.
208 | 
209 | ## Problem 6
210 | 
211 | We generate some sample data below. The data is numeric, and has 3
212 | columns: X, Y, Z.
213 | 
214 | ```{r gen-data-corr}
215 | N <- 100
216 | Sigma <- matrix(c(1, 0.75, 0.75, 1), nrow = 2, ncol = 2) * 1.5
217 | means <- list(c(11, 3), c(9, 5), c(7, 7), c(5, 9), c(3, 11))
218 | dat <- lapply(means, function(mu)
219 |   rmvnorm(N, mu, Sigma))
220 | dat <- as.data.frame(Reduce(rbind, dat)) %>%
221 |   mutate(Z = as.character(rep(seq_along(means), each = N)))
222 | names(dat) <- c("X", "Y", "Z")
223 | ```
224 | 
225 | a. **Compute the overall correlation between X and Y.**
226 | 
227 | ```{r problem6a}
228 | # your code here
229 | ```
230 | 
231 | b. **Make a plot showing the relationship between X and Y. Comment on
232 | the correlation that you see.**
233 | 
234 | ```{r problem6b}
235 | # your code here
236 | ```
237 | 
238 | Your text answer here.
239 | 
240 | The correlation between X and Y is negative.
241 | 
242 | c. **Compute the correlations between X and Y for each level of Z.**
243 | 
244 | ```{r problem6c}
245 | # your code here
246 | ```
247 | 
248 | d. **Make a plot showing the relationship between X and Y, but this
249 | time, color the points using the value of Z. Comment on the result,
250 | especially any differences between this plot and the previous plot.**
251 | 
252 | ```{r problem6d}
253 | # your code here
254 | ```
255 | 
256 | Your text answer here.
257 | 
258 | 
259 | # Part IV: Bash practices
260 | 
261 | ## Problem 7: Bash practices on Odyessy
262 | 
263 | Please answer the following question using bash commands and include those in 
264 | your answer. Data are available at `/n/stat115/2020/HW1/public_MC3.maf`
265 | 
266 | Mutation Annotation Format ([MAF](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/)) 
267 | is a tab-delimited text file with aggregated mutation information. 
268 | MC3.maf `/n/stat115/2020/HW1/public_MC3.maf` is a curated list of [somatic mutation](https://www.britannica.com/science/somatic-mutation) 
269 | occured in many patients with different types of cancers from TCGA.
270 | 
271 | Since a complete MAF file contains far more information than we need, 
272 | in this problem we will focus on part of it.
273 | 
274 | ```
275 | Chromosome	Start_Position	Hugo_Symbol	Variant_Classification
276 | 10	123810032	TACC2	Missense_Mutation
277 | 10	133967449	JAKMIP3	Silent
278 | 11	124489539	PANX3	Missense_Mutation
279 | 11	47380512	SPI1	Missense_Mutation
280 | 11	89868837	NAALAD2	Missense_Mutation
281 | 11	92570936	FAT3	Silent
282 | 12	107371855	MTERFD3	Missense_Mutation
283 | 12	108012011	BTBD11	Missense_Mutation
284 | 12	117768962	NOS1	5'Flank
285 | ```
286 | 
287 | In  `/n/stats115/2020/HW1/MC3/public_MC3.maf`, `Chromosome` and `Start_Position` 
288 | together specifies the genomics location where a location has happened. 
289 | `Hogo_symbol` is the overlapping gene of that location, and 
290 | `Variant_Classification` specifies how it influences downstream biological 
291 | processes, e.g. transcription and translation.
292 | 
293 | Please include your bash commands and the full output from bash console 
294 | with text answer to the questions.
295 | 
296 | 
297 | a. How many lines are there in this file? How many times "KRAS" gene has emerged?
298 | 
299 | ```{r q7a, engine="bash", eval = FALSE}
300 | # your bash code here
301 | ```
302 | 
303 | ```
304 | your bash output here
305 | ```
306 | 
307 | b. How many unique `Variant_Classification` are there in the MAF? Please 
308 | count occurence of each type and sort them. Which one is the most frequent? 
309 | 
310 | ```{r q7b, engine="bash", eval = FALSE}
311 | # your bash code here
312 | ```
313 | 
314 | ```
315 | your bash output here
316 | ```
317 | 
318 | Your text answer:
319 | 
320 | c. What are the top FIVE most frequent genes? Please provide 
321 | the bash command and equivalent Python command. If you are a PI 
322 | looking for a gene to investigate (you need to find a gene with potentially 
323 | better biological significance), which gene out of the top 5 would you 
324 | choose? Why?
325 | 
326 | ```{r q7c, engine="bash", eval = FALSE}
327 | # your bash code here
328 | ```
329 | 
330 | ```
331 | your bash output here
332 | ```
333 | 
334 | Equivalent python command:
335 | 
336 | ```{r q7cpy, engine="python", eval=FALSE}
337 | # your python command here
338 | ```
339 | 
340 | ```
341 | your python output here
342 | ```
343 | 
344 | Yor text answer:
345 | 
346 | 
347 | d. Write a bash program that determines whether a user-input year ([YYYY]) is 
348 | a leap year or not (all years that are multiples of four. If the year is 
349 | centennial and not divisible by 400, then it is not a leap year). 
350 | The user input can be either positional or interactive. 
351 | Please include the content of your shell script here and test on 
352 | 1900/2000/2002, does your code run as expected?
353 | 
354 | ```{r q7d, engine="bash", eval = FALSE}
355 | # your bash code here
356 | ```
357 | 
358 | 
359 | 
360 | # Part V. High throughput sequencing read mapping
361 | 
362 | We will give you a simple example to test high throughput sequencing
363 | alignment for RNA-seq data. Normally for paired-end sequencing data,
364 | each sample will have two separate FASTQ files, with line-by-line
365 | correspondence to the two reads from the same fragment. Read mapping
366 | could take a long time, so we have created just two FASTQ files of one
367 | RNA-seq sample with only 3M fragments (2 * 3M reads) for you to run STAR
368 | instead of the full data. The files are located at
369 | `/n/stat115/2020/HW1`. The mapping will generate one single output
370 | file. Make sure to use the right parameters for single-end (SE) vs
371 | paired-end (PE) modes in BWA and STAR.
372 | 
373 | Please include the commands that you used to run BWA and STAR in your
374 | answers.
375 | 
376 | 
377 | ## Problem 8: BWA
378 | 
379 | 1. Use BWA (Li & Durbin, Bioinformatics 2009) to map the reads to the
380 | Hg38 version of the reference genome, available on Odyssey at
381 | `/n/stat115/HW2_2019/bwa_hg38_index/hg38.fasta`. In 
382 | `/n/stat115/HW1_2020/BWA/loop`, you are provided with three `.fastq` 
383 | files with following structure (`A_l` and `A_r` are paired sequencing reads 
384 | from sample_A). Write a for loop in bash to align reads to the reference 
385 | using BWA PE mode and geneterate output in SAM format.
386 | 
387 | How many rows are in each output `.sam` files? Use SAMTools on the output
388 | to find out how many reads are mappable and uniquely mappable 
389 | (please also calculate the ratio). Please include full samtools output 
390 | and text answer.
391 | 
392 | 
393 | ```{r 8, engine="bash", eval = FALSE}
394 | # please provide the content of your sbatch script (including the header)
395 | ```
396 | 
397 | ```
398 | samtools output
399 | ```
400 | 
401 | You text answer
402 | 
403 | ## Problem 9: STAR alignment
404 | 
405 | 1. Use STAR (Dobin et al, Bioinformatics 2012) to map the reads to the
406 | reference genome, available on Odyssey at
407 | `/n/stat115/HW1_2020/STARIndex`. Use the paired-end alignment mode and
408 | generate the output in SAM format. Please include full STAR report.  
409 | How many reads are mappable and how many are uniquely mappable?
410 | 
411 | ```{r 9, engine="bash", eval = FALSE}
412 | # please provide the content of your sbatch script (including the header)
413 | ```
414 | 
415 | ```
416 | Log file from STAR
417 | ```
418 | Yor text answer here.
419 | 
420 | 
421 | 2. If you are getting a different number of mappable fragments between
422 | BWA and STAR on the same data, why?
423 | 
424 | Your text answer here.
425 | 
426 | 
427 | # Part VII: Dynamic programming with Python
428 | 
429 | ## Problem 10 
430 | 
431 | Given a list of finite integer numbers, 
432 | Write a python script to maximize the Z where Z is the sum of the
433 | numbers from location X to location Y on this list. Be aware, your
434 | algorithm should look at each number ONLY ONCE from left to right.
435 | Your script should return three values: the starting index location X, 
436 | the ending index location Y, and Z, the sum of numbers between index 
437 | X and Y (inclusive).
438 | 
439 | For example, if A=[-2, 1, 7, -4, 5, 2, -3, -6, 4, 3, -8, -1, 6, -7, -9, -5], 
440 | your program should return (start_index = 1, end_index = 5, sum = 11) 
441 | corresponding to [1, 7, -4, 5, 2].
442 | 
443 | Please test your program with this example and see if you can get the 
444 | correct numbers.
445 | 
446 | Hint: Consider dynamic programming.
447 | 
448 | ```{python dynamic-programming, eval=TRUE, echo = TRUE}
449 | 
450 | ```
451 | 
452 | 


--------------------------------------------------------------------------------
/HW1/README.md:
--------------------------------------------------------------------------------
1 | # Homework-1
2 | 
3 | - Due: 02/09/2020 11:59pm
4 | 
5 | Please fill the Odyssey signup form by Wednesday 01/29 so that you can get the account in time.
6 | Please feel free to post on Canvas Discussion if you find anything unclear. Thanks!
7 | 
8 | 


--------------------------------------------------------------------------------
/HW2/code/Homework2.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "STAT115 Homework 2"
  3 | author: ''
  4 | date: 'Due: Sunday, Feburary 23, 2020 at 11:59pm'
  5 | output:
  6 |   html_document: default
  7 |   pdf_document: default
  8 | ---
  9 | 
 10 | ```{r setup, include=FALSE}
 11 | knitr::opts_chunk$set(echo = TRUE, eval = TRUE)# please knit with `echo=TRUE, eval=TRUE`
 12 | ```
 13 | <br><br>
 14 | 
 15 | # Part I: RNA-seq quality control
 16 | For this question, we will examine a series of tools to perform essential quality-control analyses for high-throughput RNA sequencing data. 
 17 | 
 18 | ### Problem I.1 (3pts)
 19 | **You are asked by a collaborator to analyze four RNA-seq libraries. She suspects that the libraries are generally of high-quality but is concerned that a sample may have been switched with her benchmates during processing. Execute FastQC, STAR, and RSeQC (tin.py) to determine whether any of the samples exhibit unusual quailty control metrics. Overall, identify the best and worst libraries. Your answer should provide evidence from all three tools. Include screen shots and tables as necessary as if you were delivering a report to the collaborator.**
 20 | ```
 21 | Sequencing data:
 22 | /n/stat115/2020/HW2/raw_data
 23 | 
 24 | modules: fastqc/0.11.8-fasrc01, STAR/2.6.0c-fasrc01
 25 | index: /n/stat115/2020/HW2/star_hg38_index
 26 | bed: /n/stat115/2020/HW2/hg38_RefSeq.bed 
 27 | ```
 28 | 
 29 | ```
 30 | Hint: not required but helpful to run STAR with these parameters:
 31 | --outSAMtype BAM SortedByCoordinate 
 32 | --readFilesCommand zcat
 33 | ```
 34 | 
 35 | Student response
 36 | <br>
 37 | 
 38 | ### Problem I.2 (0.5pt; graduate students only)
 39 | 
 40 | **Your collaborator recalls that one of her samples was left on the bench for a couple of days before the full RNA-seq library was processed. Using the metrics from the question above, can you identify this sample? Provide your rationale.**
 41 | 
 42 | Student response
 43 | 
 44 | -----
 45 | 
 46 | <br><br>
 47 | 
 48 | # Part II: Pseudoalignment
 49 | 
 50 | ### Problem II.1 (1pt)
 51 | **Process the 4 sequencing libraries with Salmon introduced in the previous question. Identify the transcript and gene with the highest expression in each library from the Salmon output.**
 52 | 
 53 | ```
 54 | module: salmon/0.12.0-fasrc01
 55 | index: /n/stat115/2020/HW2/salmon_hg38_index
 56 | ```
 57 | 
 58 | Student response
 59 | <br>
 60 | 
 61 | ### Problem II.2 (1pt)
 62 | **Report the relative speed of Salmon and STAR for the analyses of these four samples. Comment on your results based on the lecture material.**
 63 | 
 64 | ```
 65 | Hint: you can parse the times from log files or use the `time` tool in the command line...
 66 | e.g. time sleep 2
 67 | ```
 68 | 
 69 | Student response
 70 | <br>
 71 | 
 72 | ### Problem II.3 (1pt; graduate students only)
 73 | **Plot the relationship between effective length, normalized read counts, TPM, and FPKM for runX from the Salmon output. Comment on the relative utility of each metric when analyzing gene expression data.** 
 74 | 
 75 | Student response
 76 | 
 77 | -----
 78 | 
 79 | <br><br>
 80 | 
 81 | # Part III: Differential expression 
 82 | 
 83 | In 2014, a controversial manuscript from Lin et al. argued that, based on RNA-seq of several tissues from both mouse and human, fundamental physiological differences existed between these two organisms. Here, we will investigate these claims for a subset of the data analyzed. (Note: a copy of this manuscript is included as the `part3_4-manuscript.pdf` file associated with this homework.) The provided data is a counts matrix of the samples with the following conventions:
 84 | ```
 85 | row: <mouse>_<human> gene name (e.g. Stag2_STAG2)
 86 | column: <organism>_<tissue>_<batch> sample identifier (e.g. human_adipose_3)
 87 | ```
 88 | 
 89 | ### Problem III.1 (1pt)
 90 | **Perform a principle component analysis of the samples using the top 5,000 most variable genes as features. Indicate the species, tissue, and batch per sample plot. Do the results support the conclusions of the original paper? Do the results suggest the presence of a batch effect? **
 91 | 
 92 | ```{r import_piii, include=TRUE, echo=TRUE, eval = TRUE}
 93 | 
 94 | # Import processed raw counts
 95 | counts <- readRDS("../data/part3_counts.rds")
 96 | 
 97 | # Perform a log TPM normalization
 98 | log2tpm <- sapply(1:dim(counts)[2], function(idx){
 99 |   log2((counts[,idx]/sum(counts[,idx]) * 1000000) + 1)
100 | })
101 | colnames(log2tpm) <- colnames(counts)
102 | 
103 | # Continue analysis here.
104 | ```
105 | 
106 | Student response
107 | <br>
108 | 
109 | ### Problem III.2 (1pt)
110 | **Run COMBAT on the samples to remove the batch effect. Visualize the results using a similar principle component analysis as the question above. Provide evidence that the batch effects are successfully adjusted. Do these results change the primary interpretation of the results?**
111 | 
112 | Student response
113 | <br>
114 | 
115 | ### Problem III.3 (1pts)
116 | **Run DESeq2 adjusting for the batch effect to identify differentially-expressed genes between the lung and adipose tissue. Report the number of statistically-significant genes as well as whether they are more highly expressed in either adipose tissue or lung tissue.**
117 | 
118 | Student response
119 | <br>
120 | 
121 | ### Problem III.4 (1pts)
122 | **Identify the top 5 most differentially expressed genes that are overexpressed in each of the tissues. Comment on the biological relevance of these. It may be useful to use data from the GTEx consortium when interpreting your result.**
123 | 
124 | ```
125 | GTEx link: https://www.gtexportal.org/home
126 | ```
127 | Student response
128 | <br>
129 | 
130 | ### Problem III.5 (1pts)
131 | **Visualize the differential gene expression values by making a volcano and an MA plot to summarize the differences between the two tissues. Be sure to use the `lfcShrink` function to get more robust estimates of the fold-changes for genes. **
132 | 
133 | Student response
134 | <br>
135 | 
136 | ### Problem III.6 (1pts; graduate students only)
137 | **Rerun differential gene expression analyses without accounting for the batch effect. Compare the number of differentially expressed genes and anecdotes of top differentially expressed genes. Are the numbers of differentially expressed genes before/after the batch effect consistent with what you would have expected? Comment on the biological relevance of the top genes.** 
138 | 
139 | Student response
140 | 
141 | -----
142 | 
143 | <br><br>
144 | 
145 | # Part IV: Gene ontology
146 | 
147 | While the previous question identified genes that were differentially expressed between tissues and specific anecdotes were used for interpretation, we often want to assesss the differences between samples using a more wholistic approach. Pathway enrichment analyses provide a statistically principled way of examining many differentially expressed genes in an effort to identify biological patterns that explain the results. These patterns are defined using prior biological knowledge. 
148 | 
149 | ### Problem IV.1 (1.5pts)
150 | 
151 | **Run the up and down regulated genes computed in problem III.3 separately on DAVID (http://david.abcc.ncifcrf.gov/) to see whether these genes are enriched in specific biological process, pathways, etc. For example, consider reporting the enrichments for the top 100 genes in the KEGG pathways. If you were to summarize the results in a paper, how would you describe the systematic biologial features that are different between these tissues? Your analysis should comment on the stability of enriched pathways (with at least 2 different input gene list sizes) and attempt to interpret the results in the differential physiological properties of the tissues.**
152 | 
153 | Student response
154 | 
155 | ### Problem IV.2 (0.5pts)
156 | 
157 | **Describe in at least 3 but no more than 7 sentences the methodological differences between how approaches like DAVID and approaches like GSEA work in identifying enriched pathways from RNA-seq data.**
158 | 
159 | Student response
160 | 
161 | ### Problem IV.3 (1pt; graduate students only)
162 | **Run Gene Set Enrichment analysis (http://www.broadinstitute.org/gsea/index.jsp) using the summary statistics from problem III.3. What are the gene sets or experiments that best capture the differential expression data between these two cell types? Comment on the biological relevance of the results and compare them to the results produced from the DAVID analysis.**
163 | 
164 | ```
165 | Hint: the fgsea package (https://bioconductor.org/packages/release/bioc/html/fgsea.html)
166 | is, in my hands, easier to use than the original java distribution of gsea
167 | ```
168 | 
169 | Student response
170 | 
171 | -----
172 | 
173 | <br><br>
174 | 
175 | # Part V: Python programming
176 | 
177 | ### Problem V.1 (2pts)
178 | 
179 | **RSeQC on RNA-seq generates many output files. One such file is called geneBodyCoverage.r which contains normalized reads mapped to each % of gene / transcript body. Suppose that we want to visualize all 12 samples from a recent RNA-seq library together to quickly perform quality control. These data files are present in the `part5` folder. Write a python program to extract the values and name from each file. The same script should then draw the gene body coverage for all the samples (3 rows x 4 cols) in one figure. We provide an example with 3 x 2 samples in one figure. Include your code and final figure in your report.**
180 | 
181 | Student response
182 | 
183 | -----
184 | 
185 | <br><br>
186 | 
187 | # Part VI: Batch effects and classification in the literature
188 | 
189 | In a recent manuscript (published September 2019), Zhou et al. describe a modified version of RNA-seq called SILVER-seq that enables profiling of extracellular RNAs (exRNAs). The manuscript reports impressive performance in classifying patients with breast cancer compared to healthy controls as well as whether the cancer was recurrent. About three weeks ago, the original findings were challenged by Hartl and Gao where they argued that a batch effect confounded the interpretation of the work. The authors then rebutted the challenge. 
190 | 
191 | ### Problem VI.1 (1 pts)
192 | **In the main manuscript (see `1_original.pdf`), what bioinformatics methods were used in conjuction with the SILVER-seq protocol to predict patient status? Name the methods and describe their purpose (list at least 3).**
193 | 
194 | Student response
195 | 
196 | <br>
197 | 
198 | -----
199 | The next questions are for graduate students only. In 3-5 sentences each, answer the following questions related to the short letters that comment on the manuscript.
200 | 
201 | ### Problem VI.2 (0.5 pts; graduate students only)
202 | **Briefly summarize the Hartl and Gao response (see `2_Hartl_response.pdf`). Specifically, what evidence do Hartl and Gao offer that the interpretation of the original manuscript may be confounded by a batch effect? If you were the bioinformatician analyzing the original data, what steps, if any, could you take to eliminate the batch effect?**
203 | 
204 | Student response
205 | 
206 | ### Problem VI.3 (0.5 pts; graduate students only)
207 | **Summarize the response to Hartl (see `3_response_to_Hartl.pdf`). Specifically, how do the original authors argue that there is "a lack of between batch differences"? Do you find their rebuttal convincing?**
208 | 
209 | Student response
210 | 
211 | ### Problem VI.4 (0.5 pts; graduate students only)
212 | **Design a modified version of the study that utilizes both proper experimental setup and computational tools discussed in this lab/homework that would ameloirate the potential batch effect in assessing the efficacy of SILVER-seq. Comment specifically on batch design and analytical methods. Assume that you want to test the same number of samples (~130) but no more than 10 samples can be processed per batch.**
213 | 
214 | Student response
215 | 
216 | ### Problem VI.5 (0.0 pts; graduate students only)
217 | **After reading the primary manuscript and both responses, how do you interpret the efficacy of SILVER-seq as a tool for cancer diagnostics/prediction? What remaining questions do you think need to be answered before this technology could be confidently used with patients?**
218 | 
219 | Optional response (feedback will be given but students will not receive points.)
220 | 
221 | 
222 | **Note:** you can access the raw SILVER-seq data from this study here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131512
223 | 
224 | <br><br>
225 | 


--------------------------------------------------------------------------------
/HW2/data/part3_counts.rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/data/part3_counts.rds


--------------------------------------------------------------------------------
/HW2/data/part5/.Rhistory:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/data/part5/.Rhistory


--------------------------------------------------------------------------------
/HW2/data/part5/A.geneBodyCoverage.r:
--------------------------------------------------------------------------------
1 | A.ds <- c(0.0,0.0425833309242,0.0977599775842,0.156917206121,0.229372691429,0.285565609552,0.308986552749,0.338505124721,0.376918851724,0.405403361924,0.445907450359,0.465981591477,0.498704643049,0.528590140165,0.547986025711,0.551223862143,0.556732186896,0.589410762693,0.585725944721,0.610581231668,0.629381141827,0.65945121341,0.666874220284,0.678729238152,0.698058410036,0.70322427132,0.718019137926,0.748073642989,0.758456512699,0.765548175048,0.78169288143,0.795073418386,0.775183851736,0.769831192195,0.768165574416,0.777830160357,0.781759595093,0.795035613977,0.795389196389,0.798215631901,0.792320367904,0.777151904786,0.789758563255,0.799394239942,0.825959620444,0.829159652466,0.839504717768,0.852871911991,0.860779704814,0.861364561258,0.862389727875,0.86359724517,0.879332774421,0.883746995105,0.890144835362,0.891605864576,0.880753775437,0.871587318177,0.871282659118,0.880669271465,0.886789138126,0.892933466464,0.90244683477,0.903723289517,0.907545982392,0.910436907777,0.905769175175,0.937582697144,0.943233344378,0.946698007263,0.965471231957,0.983646257475,0.985783318471,0.995298910566,1.0,0.99896149065,0.988825461492,0.977381844544,0.946482299753,0.941454313372,0.925974519828,0.905017534574,0.92283897768,0.911557697311,0.897903634338,0.912676263056,0.928331735912,0.906567515339,0.847823911511,0.817177878639,0.785050802454,0.727345707977,0.699839664831,0.64017096488,0.532759744086,0.45725766818,0.370447626439,0.246529221696,0.131906253961,0.0216619262903)
2 | 
3 | 
4 | png("analysis/RSeQC/gene_body_cvg/11.1.FF/11.1.FF.geneBodyCoverage.curves.png")
5 | x=1:100
6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1)
7 | plot(x,V11.1.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1])
8 | dev.off()
9 | 


--------------------------------------------------------------------------------
/HW2/data/part5/B.geneBodyCoverage.r:
--------------------------------------------------------------------------------
1 | B.ds <- c(0.05437399393153749,0.1237482015413325,0.21601447314064304,0.31159988034017577,0.39913673983959885,0.48560520805139673,0.5533127252524965,0.6127010356272882,0.6719041581788915,0.6988703542785509,0.7403809171070813,0.7739568939728486,0.8012934657188849,0.8395133833815297,0.8721776663485235,0.900668100685195,0.9151697317625607,0.9342868132024673,0.9509537172894201,0.9513668285873018,0.9543868146269889,0.9660678927050242,0.9772788786165045,0.9860966680437043,0.9927206940269805,0.9850852576247525,0.9898431601589767,1.0,0.9923360731634354,0.9827347967919771,0.9833758315645522,0.9767233151469394,0.9730053134660038,0.9491018390575364,0.9437171469679055,0.9271784498354677,0.9221641333922136,0.9137452100457272,0.9101554153193065,0.9036453510733771,0.8789156540691463,0.8569210387612359,0.8448410946024872,0.8432456302796336,0.8356101938774056,0.8329036026154218,0.8254675992535506,0.8051396743543355,0.7827889286172167,0.7757233009017223,0.764213165429707,0.750252852604738,0.7401387484152195,0.7253237225601504,0.706633997635294,0.6842974971153435,0.6743543355318452,0.6705793529822363,0.6561774384250488,0.6416473169133463,0.6256072023818003,0.6126867904101198,0.6033846635991966,0.5818743856750096,0.5746663057878317,0.5647516346386701,0.5459337027592985,0.5313893360304278,0.5185116597102523,0.5046368181882933,0.4995940113107024,0.49613242353879683,0.4799783472699041,0.473397056938133,0.467414065727432,0.4389948574766022,0.43559025057337,0.42788358808530036,0.4437954956623314,0.45071867120614256,0.3803188079602274,0.3624553056311344,0.35387968489579624,0.3365290103847633,0.31417826464764453,0.2925967606376159,0.2907876180572373,0.2802319121355005,0.273550905283551,0.26382142195757774,0.235744098918788,0.19189732047465063,0.16399094004188094,0.15240957848402398,0.09125486118035869,0.07495833273978261,0.05629709824926281,0.034615877719055825,0.016239547571902734,0.0)
2 | 
3 | 
4 | png("/liulab/jingxin/proj/cidc-rnaseq/results/pilot1/gene_body_coverage/5760-Frozen.geneBodyCoverage.curves.png")
5 | x=1:100
6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1)
7 | plot(x,5760_Frozen.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1])
8 | dev.off()
9 | 


--------------------------------------------------------------------------------
/HW2/data/part5/C.geneBodyCoverage.r:
--------------------------------------------------------------------------------
1 | C.ds <- c(0.05924531390685111,0.1360890562331178,0.22388475075714168,0.3211099287877548,0.42092166653024476,0.5126790537775231,0.5868052713432103,0.6506834738479168,0.7158058443152984,0.7144470819350086,0.766129164279283,0.804960301219612,0.8332487517393795,0.8648604403699762,0.8888761561758206,0.9075550462470329,0.9080461651796676,0.9263485307358599,0.9424408610951952,0.9584677089301793,0.9607268560202996,0.9686666120978964,0.9817303757059834,0.9912089711058362,1.0,0.9885569288696079,0.9932061880985512,0.9959400834902186,0.9882949987722026,0.9829418024064828,0.989506425472702,0.9832528443971515,0.9723008921993943,0.9434722108537285,0.9432757632806744,0.9258410411721372,0.9063599901776214,0.889121715642138,0.8915118277809609,0.8886633379716788,0.8599492510436277,0.8361790947041008,0.8276827371695179,0.8317099124171237,0.8177457640992061,0.8158467708930179,0.7963166080052386,0.7774740116231481,0.7598101006793812,0.748088728820496,0.7371695178849145,0.7213391176229844,0.7064254727019726,0.6948678071539658,0.6807890644184333,0.6603748874519113,0.6519767537038553,0.6401571580584432,0.6195792747810428,0.6182859949251044,0.600949496603094,0.5913399361545387,0.57595154293198,0.5575509535892609,0.5434558402226406,0.527576328067447,0.5073585986739789,0.4940001637063109,0.4834083653924859,0.4716869935336007,0.4716869935336007,0.48010149791274453,0.47104853892117543,0.4683473847916837,0.4491282638945731,0.41966112793648197,0.4151264631251535,0.4105263157894737,0.41453712040599167,0.4252762543996071,0.36125071621511007,0.3407055741998854,0.3360890562331178,0.3200294671359581,0.29658672341818776,0.27728574936563805,0.28190226733240564,0.26772530081034623,0.26369812556274047,0.2641565032331996,0.24420070393713678,0.1830236555619219,0.15928624048457068,0.15830400261930097,0.08545469427846443,0.08291724645985103,0.06479495784562495,0.04320209544077924,0.017909470410084307,0.0)
2 | 
3 | 
4 | png("/liulab/jingxin/proj/cidc-rnaseq/results/pilot1/gene_body_coverage/5760-Norm.geneBodyCoverage.curves.png")
5 | x=1:100
6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1)
7 | plot(x,5760_Norm.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1])
8 | dev.off()
9 | 


--------------------------------------------------------------------------------
/HW2/data/part5/D.geneBodyCoverage.r:
--------------------------------------------------------------------------------
1 | D.ds <- c(0.03190949160116182,0.08926939248065321,0.16172079702638475,0.24599354091767717,0.33469420916864706,0.42810716389413606,0.49326671135214184,0.5520890459651047,0.6275668758759368,0.6508033229744277,0.7005260699124571,0.7356854142546666,0.7641622488980968,0.8049479007982451,0.8472975443300226,0.8745759957751914,0.8853817561391749,0.9210692015518047,0.93242337456584,0.9368309873458859,0.9331342798529442,0.9345154672678895,0.9359372778420978,0.9515568825787581,0.9664655820283143,0.9625048240001625,0.9825117299372372,0.9966689009404262,0.9900067028212785,0.9844413300022343,1.0,0.9917941218288546,0.968192066296996,0.9401620864054597,0.9378871894867263,0.9166615887717587,0.9014279040480978,0.9019356935388865,0.9108524769971361,0.9090853695691914,0.8915970995064286,0.848759978063494,0.8489224707005464,0.8464647695651291,0.8508520707655435,0.8628155911685251,0.8405540998923486,0.8060244145187171,0.7749070745231856,0.7755164219121321,0.762577945686836,0.7768163630085512,0.7653403205167266,0.7444397050758638,0.7158613125342758,0.6896390632299474,0.682286271403327,0.6768427680620722,0.6664635508703511,0.672983567932078,0.6602888306623607,0.6597607295919403,0.6457660512258039,0.6263684926776756,0.6100176710742794,0.5998212580992424,0.5578575345804643,0.5324680600410294,0.5306603294538216,0.5306400178741901,0.5285073020128775,0.5400849024028599,0.5366116222858651,0.5460768183941666,0.5160969268580018,0.464728941969817,0.4725285885483314,0.4519732699612049,0.5525359007169988,0.5550139134320476,0.40078808928970405,0.38486381085857047,0.39629923019113195,0.38646842564946277,0.3570978815022444,0.3338208112444905,0.3503138139053074,0.30725326508642575,0.31422013690004674,0.3145451221741515,0.3110921536367883,0.23400970893506387,0.20463916478784555,0.2172932788982999,0.10539678670810229,0.1039140413949993,0.08451648284687101,0.0613003473280117,0.02211931021875571,0.0)
2 | 
3 | 
4 | png("/liulab/jingxin/proj/cidc-rnaseq/results/pilot1/gene_body_coverage/5812-FFPE.geneBodyCoverage.curves.png")
5 | x=1:100
6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1)
7 | plot(x,5812_FFPE.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1])
8 | dev.off()
9 | 


--------------------------------------------------------------------------------
/HW2/data/part5/E.geneBodyCoverage.r:
--------------------------------------------------------------------------------
1 | E.ds <- c(0.0,0.062990547978,0.14598588179,0.227814668581,0.316621799474,0.396120483369,0.445220148361,0.482920555157,0.533937544867,0.574662000479,0.616511127064,0.630407992343,0.672194304858,0.700514477148,0.720384661402,0.729172648959,0.744017707586,0.779770878201,0.783016271835,0.809655419957,0.827916367552,0.851046901173,0.851779731993,0.873878320172,0.891920914094,0.891331658291,0.897891241924,0.911800071788,0.934090093324,0.93862766212,0.949814548935,0.960498923187,0.939351519502,0.935561737258,0.930635319454,0.941346614022,0.946267049533,0.96054379038,0.959673366834,0.960238693467,0.962455132807,0.943859176837,0.956990308686,0.96425879397,0.983728164633,0.987359416128,0.986877841589,0.997041756401,0.999009930605,1.0,0.991558985403,0.985361330462,0.99290201005,0.986351399856,0.998606125867,0.994131371141,0.985131012204,0.975969131371,0.971126465662,0.973570232113,0.984176836564,0.981646326872,0.980174682939,0.980049054798,0.978490667624,0.975397822446,0.968455372099,0.990344580043,0.986740248863,0.976058865757,0.981039124192,0.98432938502,0.975008973439,0.97118329744,0.963481095956,0.957256520699,0.946413615698,0.93068616894,0.899943168222,0.894212132089,0.873034816942,0.839291696578,0.848121560182,0.825598229241,0.803795764537,0.811593682699,0.807648360852,0.772771596076,0.724338956688,0.688238813113,0.654717037569,0.604172648959,0.56914034458,0.514653625269,0.426707944484,0.359790021536,0.286159966499,0.182923546303,0.0898630055037,0.00443586982532)
2 | 
3 | 
4 | png("analysis/RSeQC/gene_body_cvg/2.2.FF/2.2.FF.geneBodyCoverage.curves.png")
5 | x=1:100
6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1)
7 | plot(x,V2.2.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1])
8 | dev.off()
9 | 


--------------------------------------------------------------------------------
/HW2/data/part5/F.geneBodyCoverage.r:
--------------------------------------------------------------------------------
1 | F.ds <- c(0.0,0.0517678183082,0.0879088850526,0.116247786023,0.178082191781,0.208394054212,0.201376589019,0.209447794991,0.221285563751,0.197654866265,0.208438894245,0.217160280699,0.22845996906,0.239759657422,0.249893504921,0.251799206331,0.258031970944,0.262560814295,0.269152299173,0.288747393673,0.300091922068,0.306773087012,0.314440732686,0.322556778692,0.339125170953,0.347779297357,0.348317377755,0.37353989642,0.393179830953,0.404770979531,0.412685245387,0.422213752438,0.420151110912,0.42160841199,0.414097706432,0.425531914894,0.435150102011,0.436674663139,0.443781808399,0.445777189875,0.435710602426,0.428625877183,0.443624868282,0.450709593525,0.470125327893,0.486088379705,0.498531488913,0.499719749793,0.514920521041,0.516153621954,0.520973925521,0.522834786898,0.537721677914,0.546129184136,0.549380086541,0.553841669843,0.536667937134,0.528843351344,0.518373203596,0.516400242136,0.520278905006,0.542272941282,0.561262695334,0.561912875816,0.572696903796,0.587583794812,0.591440037666,0.617985337309,0.633253368607,0.652467322826,0.686994148376,0.717642311055,0.724323475999,0.745913951976,0.76660762729,0.772661031769,0.790753985158,0.803286774432,0.781808398538,0.803667914714,0.811111360222,0.81185122077,0.846153846154,0.855794453288,0.865547160505,0.89471560209,0.942380557362,0.97829742394,0.958029728942,0.967356455844,1.0,0.980001345201,0.971481738896,0.955563527117,0.788556823532,0.719592852499,0.610160751519,0.453915655898,0.26785193821,0.0785821581508)
2 | 
3 | 
4 | png("analysis/RSeQC/gene_body_cvg/3.1.FF/3.1.FF.geneBodyCoverage.curves.png")
5 | x=1:100
6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1)
7 | plot(x,V3.1.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1])
8 | dev.off()
9 | 


--------------------------------------------------------------------------------
/HW2/data/part5/G.geneBodyCoverage.r:
--------------------------------------------------------------------------------
1 | G.ds <- c(0.0,0.0670088790233,0.0910496273981,0.122482955446,0.194644839068,0.230141113049,0.222728714127,0.224770096718,0.228139368955,0.204792294276,0.216822578088,0.223997146028,0.242329950848,0.257590772158,0.264864436341,0.267599492627,0.276458696686,0.280719835104,0.290966386555,0.306821785318,0.312747740606,0.319724116062,0.326244648803,0.336312827018,0.350107023942,0.357856350087,0.365883145711,0.379934992865,0.400309180276,0.413746630728,0.414063738703,0.425856191533,0.429304740764,0.432614555256,0.43114793087,0.441335024576,0.456318376407,0.458439035992,0.467654986523,0.470172031077,0.463195655621,0.463017282385,0.473759315047,0.478773584906,0.496650547011,0.504895354368,0.512803234501,0.519938163945,0.531294593309,0.534624227049,0.53743856033,0.541521325511,0.56038925004,0.575491517362,0.569644839068,0.569803393055,0.558684794673,0.552719200888,0.546813064849,0.557138893293,0.55012287934,0.58076343745,0.592258601554,0.590613603932,0.611146345331,0.623751387347,0.631996194704,0.652865863326,0.665788013319,0.684338829872,0.7070516886,0.733629300777,0.738326462661,0.76115823688,0.785258443,0.790550182337,0.803630886317,0.821210559696,0.810944188996,0.821626763913,0.840990169653,0.833042651023,0.857519422863,0.878012525765,0.891549072459,0.918305057872,0.960936261297,0.98602742984,0.97449262724,0.979784366577,1.0,0.976553829079,0.953068019661,0.923240050737,0.792214999207,0.717674805771,0.609501347709,0.455545425717,0.268788647534,0.0652251466624)
2 | 
3 | 
4 | png("analysis/RSeQC/gene_body_cvg/3.2.FF/3.2.FF.geneBodyCoverage.curves.png")
5 | x=1:100
6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1)
7 | plot(x,V3.2.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1])
8 | dev.off()
9 | 


--------------------------------------------------------------------------------
/HW2/data/part5/H.geneBodyCoverage.r:
--------------------------------------------------------------------------------
1 | H.ds <- c(0.0,0.053502292304,0.126145115693,0.199631490371,0.293835650509,0.37147967601,0.414057253712,0.458906409332,0.506311090109,0.536749405162,0.576843998972,0.596823936131,0.635376509895,0.658415615855,0.681643329105,0.700634632444,0.717169895789,0.750306745923,0.76030707754,0.791910198058,0.813187587568,0.841816516195,0.848413626151,0.875115029721,0.897250478772,0.899086809096,0.906610374645,0.935168834614,0.94834232845,0.953005695526,0.964713855796,0.97434733587,0.950667794165,0.939164822046,0.938252874707,0.946315318228,0.949478117409,0.969951335174,0.973412589848,0.973072682203,0.969901592592,0.949818025054,0.962616376916,0.966960562423,0.987728090465,0.987274189403,0.989607945548,1.0,0.996571907048,0.998824831497,0.993396672221,0.98651146981,0.996714916971,0.98645136419,0.99645791363,0.989726084181,0.981570373318,0.971186609297,0.960751030086,0.961990449424,0.968782384494,0.964898317872,0.961611162235,0.961706502185,0.962794621169,0.957528125285,0.95181601877,0.976782649787,0.974921448172,0.964112799595,0.977159864369,0.986397476393,0.983023271238,0.982148630835,0.978695666592,0.971853988941,0.958230739258,0.938000016581,0.904784407359,0.898011125758,0.878816706875,0.848345230101,0.860959119888,0.840199882276,0.82004584608,0.824408685055,0.833393438953,0.804476417871,0.744582203762,0.710160336923,0.674984040922,0.622748111854,0.585563459099,0.53281352335,0.414720488141,0.346836371776,0.271553046319,0.171222258147,0.0817166994139,0.0025949047015)
2 | 
3 | 
4 | png("analysis/RSeQC/gene_body_cvg/2.1.FF/2.1.FF.geneBodyCoverage.curves.png")
5 | x=1:100
6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1)
7 | plot(x,V2.1.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1])
8 | dev.off()
9 | 


--------------------------------------------------------------------------------
/HW2/data/part5/J.geneBodyCoverage.r:
--------------------------------------------------------------------------------
1 | J.ds <- c(0.06359964400629835,0.14023413431916204,0.23799548161840214,0.33796125145478195,0.432340658588348,0.5244882590538783,0.603751625932772,0.665023618812898,0.72667898952557,0.7354008352159923,0.7798042034640925,0.8149654275347437,0.8384336277127404,0.8696241528034504,0.8874922982131854,0.903566783049223,0.9172314643663997,0.925734237009653,0.9452454302731567,0.9572807558020128,0.9648661600602451,0.9731772437872253,0.978585609639214,0.9874169918532211,0.9978229615937564,0.9894981858013281,0.9976449647429315,1.0,0.9931128910796193,0.9929485862942425,0.9901827890737318,0.9807352639145616,0.9606490039022386,0.9312521393852262,0.9306633805709591,0.9091805298829329,0.895680153351133,0.8734716231943589,0.8799068939549531,0.8732114739508455,0.8393099199014171,0.8119942493325119,0.810542890395016,0.8097213664681318,0.7991784760731157,0.80639419456425,0.7896898747176011,0.767481344560827,0.745491887451222,0.7326487300609297,0.719737112343397,0.7110563428493188,0.6969672075032519,0.6858492503594167,0.6689669336619429,0.6555760936537277,0.6492092832203737,0.6332443349079209,0.6181693708495927,0.6059971246662559,0.5955226945984802,0.5812555624015883,0.5666050523721503,0.5449715889641953,0.5306907647018553,0.5155884165126309,0.4963510645580886,0.48072841788183746,0.4714451975080441,0.45861573218319984,0.4599438625316629,0.47095228315191345,0.4647360854384884,0.46799479701512975,0.4470048606832341,0.419921955226946,0.4247141781337715,0.41989457109604983,0.41942904087081534,0.4271787499144246,0.36655028411035806,0.34831245293352503,0.3469021701923735,0.34059012802081196,0.31420551790237555,0.2895871842267406,0.3102622030533306,0.2872321489696721,0.2789758335044841,0.29233928938180326,0.27789416033408637,0.19863079345519272,0.17597042513863217,0.18688300130074623,0.08645170123913193,0.0844526596837133,0.06699527623742041,0.044923666735126995,0.013609913055384405,0.0)
2 | 
3 | 
4 | png("/liulab/jingxin/proj/cidc-rnaseq/results/pilot1/gene_body_coverage/5869-Norm.geneBodyCoverage.curves.png")
5 | x=1:100
6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1)
7 | plot(x,5869_Norm.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1])
8 | dev.off()
9 | 


--------------------------------------------------------------------------------
/HW2/data/part5/K.geneBodyCoverage.r:
--------------------------------------------------------------------------------
1 | K.ds <- c(0.24300350877192983,0.3102035087719298,0.34040701754385966,0.3138526315789474,0.3466105263157895,0.46616140350877194,0.5223859649122807,0.5820350877192982,0.6295017543859649,0.476519298245614,0.47943859649122805,0.4977122807017544,0.5151438596491228,0.5557052631578947,0.5570526315789474,0.566540350877193,0.5841122807017544,0.6104701754385965,0.6334596491228071,0.6302315789473685,0.6406456140350877,0.6670315789473684,0.6811508771929825,0.7008561403508772,0.7120842105263158,0.6575157894736842,0.7679157894736842,0.8416,0.8528842105263158,0.8762947368421052,0.9393122807017544,0.9460771929824562,0.8846877192982456,0.7545263157894737,0.769740350877193,0.7541614035087719,0.7481263157894736,0.8174035087719298,0.840701754385965,0.8485894736842106,0.6943719298245614,0.5897824561403509,0.5955649122807017,0.6075228070175439,0.6514526315789474,0.8120982456140351,0.8025543859649122,0.7217122807017544,0.6504140350877193,0.6294175438596491,0.6284350877192982,0.6834245614035088,0.7216280701754386,0.7188491228070175,0.6484210526315789,0.5955649122807017,0.5813894736842106,0.545740350877193,0.5023438596491228,0.5965473684210526,0.5333333333333333,0.5995228070175439,0.5714807017543859,0.5334736842105263,0.440140350877193,0.41324912280701753,0.38669473684210526,0.4225122807017544,0.4583859649122807,0.5468912280701754,0.6273122807017544,0.8967859649122807,0.9177824561403509,0.9382456140350878,0.802161403508772,0.683059649122807,0.9293473684210526,0.7798456140350877,0.7568561403508772,0.7861052631578948,0.6334596491228071,0.5705824561403509,0.7172771929824562,0.7948350877192982,0.7307228070175439,0.6842666666666667,1.0,0.7305824561403509,0.6729263157894737,0.9635649122807017,0.8319438596491228,0.5459368421052632,0.4895438596491228,0.3022315789473684,0.16067368421052633,0.19508771929824562,0.17883508771929824,0.12654035087719298,0.01552280701754386,0.0)
2 | 
3 | 
4 | png("/liulab/jingxin/proj/cidc-rnaseq/results/pilot1/gene_body_coverage/5903-FFPE.geneBodyCoverage.curves.png")
5 | x=1:100
6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1)
7 | plot(x,5903_FFPE.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1])
8 | dev.off()
9 | 


--------------------------------------------------------------------------------
/HW2/data/part5/L.geneBodyCoverage.r:
--------------------------------------------------------------------------------
1 | L.ds <- c(0.00958043229036,0.0898729133462,0.178319771831,0.260733859016,0.356027877974,0.425271929089,0.463688029528,0.507532226044,0.54897519783,0.577192209826,0.615262282247,0.638026731538,0.672395352739,0.701321896933,0.724463831334,0.734865643263,0.751258283701,0.783274613427,0.791764533177,0.809572742779,0.826321896933,0.84207141459,0.850404048877,0.861151888824,0.876978301597,0.881724827336,0.89009590918,0.916177334116,0.924188407013,0.929501160417,0.947907054777,0.955519671169,0.943034700668,0.939046640383,0.943995889607,0.948476777675,0.955858708721,0.969175544557,0.976959427341,0.984425243967,0.977742359422,0.960905824455,0.966711405643,0.972149987417,0.989360512261,0.990223834689,0.991569499203,0.995050750776,0.996183206107,0.99681584319,0.992464278724,0.990611805497,0.9963544725,0.9971933283,1.0,0.996148253782,0.986784525907,0.97556133434,0.968361155384,0.9663304253,0.967784442021,0.964352123703,0.963268601627,0.960203282722,0.959228112854,0.958130609848,0.951818219948,0.976742722926,0.974484802729,0.972129016022,0.977438274194,0.986277717194,0.981076811229,0.982751027598,0.982233733188,0.976679808741,0.966966557615,0.950643821827,0.922479238319,0.907792271342,0.890826412773,0.862595419847,0.869683751363,0.854395604396,0.834116265414,0.837048765484,0.839163381148,0.815661437799,0.7626247798,0.727260017336,0.691741464642,0.637100494925,0.608516483516,0.548730531555,0.452667561446,0.379204764701,0.299625311076,0.190025305483,0.0853361015575,0.0)
2 | 
3 | 
4 | png("analysis/RSeQC/gene_body_cvg/8.1.FF/8.1.FF.geneBodyCoverage.curves.png")
5 | x=1:100
6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1)
7 | plot(x,V8.1.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1])
8 | dev.off()
9 | 


--------------------------------------------------------------------------------
/HW2/data/part5/M.geneBodyCoverage.r:
--------------------------------------------------------------------------------
1 | M.ds <- c(0.0,0.0726672876088,0.156201062425,0.238977424318,0.323459671174,0.397884020293,0.441138439388,0.487947922744,0.536934115311,0.573897230666,0.609485402742,0.626875194044,0.667819844901,0.691446351624,0.715697212243,0.730285940437,0.737962422766,0.769732824297,0.782329078856,0.805419937701,0.820059842444,0.842423994296,0.84264917111,0.865808265354,0.879820404431,0.88796088747,0.891809363944,0.918025404039,0.929434362664,0.936431902778,0.947581566889,0.967212891031,0.965165829077,0.955302402227,0.949000863178,0.948492509459,0.939256848275,0.950781807078,0.962207824553,0.960314292245,0.957649699935,0.939560495798,0.949614981764,0.962576295705,0.982231502236,0.9807269117,0.978133966558,0.990211632088,0.988556923675,0.991893634661,0.992964930417,0.989003865535,0.99911293982,0.99776187893,1.0,0.998004114595,0.989635042971,0.975407962389,0.971010190957,0.971460544587,0.97446972566,0.975397727079,0.980938441435,0.977461847883,0.981405853915,0.979481615678,0.973664547958,0.98028338161,0.98070644108,0.97536360938,0.98429221127,0.988396570489,0.985721742869,0.984711858971,0.982173502148,0.97221113397,0.958543583655,0.94845156822,0.917035990761,0.915401752967,0.900175023797,0.848357062193,0.881785583907,0.858977901966,0.837766928349,0.831748566204,0.823962907237,0.801598755386,0.758262453813,0.721469926954,0.690108937814,0.639109800991,0.594739733131,0.538053175846,0.437999611058,0.365137852564,0.286994674227,0.193150530701,0.0997908585037,0.0150663759839)
2 | 
3 | 
4 | png("analysis/RSeQC/gene_body_cvg/9.1.FF/9.1.FF.geneBodyCoverage.curves.png")
5 | x=1:100
6 | icolor = colorRampPalette(c("#7fc97f","#beaed4","#fdc086","#ffff99","#386cb0","#f0027f"))(1)
7 | plot(x,V9.1.FF.ds,type='l',xlab="Gene body percentile (5'->3')", ylab="Coverage",lwd=0.8,col=icolor[1])
8 | dev.off()
9 | 


--------------------------------------------------------------------------------
/HW2/data/part5_example.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/data/part5_example.pdf


--------------------------------------------------------------------------------
/HW2/papers/part3_4-manuscript.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/papers/part3_4-manuscript.pdf


--------------------------------------------------------------------------------
/HW2/papers/part6/1_original.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/papers/part6/1_original.pdf


--------------------------------------------------------------------------------
/HW2/papers/part6/2_Hartl_response.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/papers/part6/2_Hartl_response.pdf


--------------------------------------------------------------------------------
/HW2/papers/part6/3_response_to_Hartl.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW2/papers/part6/3_response_to_Hartl.pdf


--------------------------------------------------------------------------------
/HW3/Homework3_release.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Homework 3"
  3 | author: ""
  4 | date: "February 23, 2020"
  5 | output: html_document
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE, cache = TRUE, message = FALSE)
 10 | ```
 11 | 
 12 | Spring 2020 STAT115/215 BIO/BST282
 13 | Due: 3/8/2020 midnight
 14 | 
 15 | # HOMEWORK 3: Classification and scRNA-seq
 16 | 
 17 | ## Part I: Sample classification
 18 | 
 19 | We provide you z-score normalized expression data of 50 breast tumor samples, 50 normal breast samples (your training and cross-validation data), and 20 samples without diagnosis (your testing data). We want to use the 100 samples with known diagnosis to train machine learning models in order to predict the 20 unknown samples. 
 20 | 
 21 | You will need the following libraries in R: `ggplot2` and `ggfortify` for plotting, `MASS` and `caret` for machine learning, and `pROC` is for evaluating testing performance. The [YouTube video on caret](https://youtu.be/z8PRU46I3NY) and the [package documentation](http://topepo.github.io/caret/index.html) might be helpful.
 22 | 
 23 | ```{r prepare}
 24 | library(ggplot2)
 25 | library(ggfortify)
 26 | library(pROC)
 27 | library(caret)
 28 | library(e1071) # KNN
 29 | library(kernlab) #SVM
 30 | 
 31 | #### read in data for question 1
 32 | dataset <- read.table(file = "q1_data/BRCA_zscore_data.txt", sep = "\t", header = TRUE, row.names = 1)
 33 | phenotype <- read.table(file = "q1_data/BRCA_phenotype.txt",sep = "\t", header = TRUE, row.names = 1)
 34 | phenotype <- as.character(phenotype[rownames(dataset),'phenotype']) # the labels
 35 | ```
 36 | 
 37 | 
 38 | ### 1. Run PCA for dimension reduction on the 100 samples with known labels, and draw these 100 samples in a 2D plot. Do cancer and normal separate from the first two PCs? Would this be sufficient to classify the unknown samples?
 39 | 
 40 | 
 41 | ```{r}
 42 | # your code here
 43 | ```
 44 | 
 45 | 
 46 | ### 2. Draw a plot showing the cumulative % variance captured from the top 100 PCs. How many PCs are needed to capture 90% of the variance? 
 47 | 
 48 | 
 49 | ```{r}
 50 | # your code here
 51 | ```
 52 | 
 53 | 
 54 | ### 3. Apply machine learning methods (KNN, logistic regression, Ridge regression, LASSO, ElasticNet, random forest, and support vector machines) on the top 25 PCs of the training data and 5-fold cross validation to classify the samples. `caret` and `MASS` already implemented all of the machine learning methods, including cross-validation. In order to get consistent results from different runs, use `set.seed(115)` right before each `train` command. 
 55 | 
 56 | ```{r}
 57 | # your code here
 58 | ```
 59 | 
 60 | ### 4. Summarize the performance of each machine learning method, in terms of accuracy and kappa. 
 61 | 
 62 | ```{r}
 63 | # your code here
 64 | ```
 65 | 
 66 | 
 67 | ### 5. For Graduate students: Compare the performance difference between logistic regression, Ridge, LASSO, and ElasticNet. In LASSO, how many PCs have non-zero coefficient? In ElasticNet, what is the lamda for Ridge and LASSO, respectively? 
 68 | 
 69 | ```{r}
 70 | # your code here
 71 | ```
 72 | 
 73 | 
 74 | ### 6. Use the PCA projections in Q1 to obtain the first 25 PCs of the 20 unknown samples. Use one method that performs well in Q4 to make predictions for unknown sampels (`q1_data/unknown_samples.txt`). Caret already used the hyper-parameters learned from cross-validation to train the parameters of each method on the full 100 training data. You just need to call this method to make the predictions. 
 75 | 
 76 | ```{r}
 77 | # your code here
 78 | ```
 79 | 
 80 | 
 81 | ### 7. For Graduate students: Can you find out the top 3 genes that are most important in this prediction method in Q6? Do they have some known cancer relevance? 
 82 | 
 83 | ```{r}
 84 | # your code here
 85 | ```
 86 | 
 87 | 
 88 | ### 8. Suppose a pathologist later made diagnosis on the 20 unknown samples (load the `q1_data/diagnosis.txt` file). Based on this gold standard, draw an ROC curve of your predictions in Q6. What is the prediction AUC? 
 89 | 
 90 | ```{r}
 91 | # your code here
 92 | ```
 93 | 
 94 | 
 95 | ## Part II. Single cell RNA-seq 
 96 | 
 97 | For this exercise, we will be analyzing a single cell RNA-Seq dataset of human peripheral blood mononuclear cells (PBMC) from 10X Genomics (droplet-based) from a healthy donor (Next GEM). The raw data can be found below which is already processed by CellRanger into the expression matrix format. 
 98 | 
 99 | https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_v3_nextgem
100 | 
101 | Please provide code and text answer for each question.
102 | 
103 | ### 1. Load data: Read the 10X data and create a Seurat (Butler et al., Nature Biotechnology 2018) Object. Please report number of cells, number of genes, and dropout rate.
104 | 
105 | ```{r}
106 | # your code here
107 | ```
108 | 
109 | 
110 | ### 2. QC genes: We want to filter genes that are detected in very few cells. Let’s keep all genes expressed in >= 10 cells. How do the above summary statistics change after filtering?
111 | 
112 | ```{r}
113 | # your code here
114 | ```
115 | 
116 | 
117 | ### 3. QC cells: Next we will filter cells with high proportion of mitochondrial reads (potential dead cells) or outlier number of genes (potential poor reactions or multiplets). What proportion of the counts from your filtered dataset map to mitochondrial genes? Remove those cells with high mitochondrial rate (> 5%). Outlier cells with extremely high or low gene coverage should be removed, and the cutoff depends on the scRNA-seq technology and the distribution of each dataset. What is the distribution of number of genes and UMIs in your dataset? Let’s filter cells with > 1 stdev of covered genes from the average.  Keep the remaining cells for downstream analysis.
118 | 
119 | 
120 | ```{r}
121 | # your code here
122 | ```
123 | 
124 | 
125 | ### 4. Dimension reduction: Use global-scaling normalization method in Seurat with the scaling factor 10000, so all the cells will be normalized to have the same sequencing depth to 10K. Use the Seurat function "FindVariableGenes" to select 2000 genes (by default) showing expression variability, then perform PCA on these genes. Provide summary plots, statistics, and tables to show 
126 | - How many PCs are statistically significant?
127 | - The top 5 genes with the most positive and negative coefficients in each of the significant PCs,
128 | - How much variability is explained in each of the significant PCs.
129 | 
130 | ```{r}
131 | # your code here
132 | ```
133 | 
134 | 
135 | ### 5. For GRADUATE students: Sometimes scRNA-seq data might have significant PCs that are heavily weighted by cell cycle genes, which need to be removed before downstream analyses. Check the top PCs in this data to see whether cell cycle components need to be removed. Provide plots and other quantitative arguments to support your case. 
136 | 
137 | 
138 | ```{r}
139 | # your code here
140 | ```
141 | 
142 | 
143 | ### 6. Visualization: Use Seurat to run UMAP on the top 20 PCs (regardless of how many PCs are statistically significant) from Q4. Visualize the cells and their UMAP coordinates and comment on the number of cell clusters that appear on this data. Describe the difference between PCA and UMAP on 2D plots?
144 | 
145 | ```{r}
146 | # your code here
147 | ```
148 | 
149 | 
150 | ### 7. For GRADUATE students: Use Seurat to run tSNE on the top 20 PCs (regardless of how many PCs are statistically significant) from Q4. Comments on the difference between tSNE and UMAP runtime and results.
151 | 
152 | ```{r}
153 | # your code here
154 | ```
155 | 
156 | 
157 | ### 8. For GRADUATE students: Try different `resolution` in clustering and draw the resulting clusters in different colors on UMAP. How does resolution influence the number of clusters and the number of cells assigned to each cluster?
158 | 
159 | 
160 | ```{r}
161 | # your code here
162 | ```
163 | 
164 | 
165 | ### 9. Clustering: Use resolution = 0.6 to cluster the cells. How many clusters to you get and how many cells are assigned to each cluster? Use Seurat to calculate differential expression between clusters (one vs the rest), identify putative biomarkers for each cell subpopulation. Visualize the gene expression values of these potential markers on your UMAP plots. 
166 | 
167 | ```{r}
168 | # your code here
169 | ```
170 | 
171 | 
172 | ### 10. Annotation: For GRADUATE students: Based on the expression characteristics of your cell clusters, provide putative biological annotation (e.g. MS4A1, CD79A genes are high in B-cells) for the clusters. This paper (Newman et al, Nat Methods 2015, https://www.nature.com/articles/nmeth.3337) may serve as a good resource as well as this tutorial PBMC (https://satijalab.org/seurat/pbmc3k_tutorial.html). 
173 | 
174 | 
175 | ```{r}
176 | # your code here
177 | ```
178 | 
179 | ## Rules for submitting the homework:
180 | 
181 | Please submit your solution directly on the canvas website. Please provide both your code in this Rmd document and an html file for your final write-up. Please pay attention to the clarity and cleanness of your homework.
182 | 
183 | The teaching fellows will grade your homework and give the grades with feedback through canvas within one week after the due date. Some of the questions might not have a unique or optimal solution. TFs will grade those according to your creativity and effort on exploration, especially in the graduate-level questions.
184 | 


--------------------------------------------------------------------------------
/HW3/q1_data/BRCA_phenotype.txt:
--------------------------------------------------------------------------------
  1 | sample	phenotype
  2 | TCGA-A7-A13G-01	Tumor
  3 | TCGA-A7-A13G-11	Normal
  4 | TCGA-AC-A23H-01	Tumor
  5 | TCGA-AC-A23H-11	Normal
  6 | TCGA-AC-A2FB-01	Tumor
  7 | TCGA-AC-A2FB-11	Normal
  8 | TCGA-AC-A2FF-01	Tumor
  9 | TCGA-AC-A2FF-11	Normal
 10 | TCGA-AC-A2FM-01	Tumor
 11 | TCGA-AC-A2FM-11	Normal
 12 | TCGA-AC-A6IX-01	Tumor
 13 | TCGA-AC-A6IX-06	Tumor
 14 | TCGA-A7-A0CE-01	Tumor
 15 | TCGA-A7-A0CE-11	Normal
 16 | TCGA-A7-A0CH-01	Tumor
 17 | TCGA-A7-A0CH-11	Normal
 18 | TCGA-A7-A0DB-01	Tumor
 19 | TCGA-A7-A0DB-11	Normal
 20 | TCGA-BH-A0AY-01	Tumor
 21 | TCGA-BH-A0AY-11	Normal
 22 | TCGA-BH-A0BV-01	Tumor
 23 | TCGA-BH-A0BV-11	Normal
 24 | TCGA-BH-A0DZ-01	Tumor
 25 | TCGA-BH-A0DZ-11	Normal
 26 | TCGA-A7-A0D9-01	Tumor
 27 | TCGA-A7-A0D9-11	Normal
 28 | TCGA-BH-A0B3-01	Tumor
 29 | TCGA-BH-A0B3-11	Normal
 30 | TCGA-BH-A0B8-01	Tumor
 31 | TCGA-BH-A0B8-11	Normal
 32 | TCGA-BH-A0BA-01	Tumor
 33 | TCGA-BH-A0BA-11	Normal
 34 | TCGA-BH-A0BJ-01	Tumor
 35 | TCGA-BH-A0BJ-11	Normal
 36 | TCGA-BH-A0BM-01	Tumor
 37 | TCGA-BH-A0BM-11	Normal
 38 | TCGA-BH-A0C0-01	Tumor
 39 | TCGA-BH-A0C0-11	Normal
 40 | TCGA-BH-A0DK-01	Tumor
 41 | TCGA-BH-A0DK-11	Normal
 42 | TCGA-BH-A0DP-01	Tumor
 43 | TCGA-BH-A0DP-11	Normal
 44 | TCGA-BH-A0E0-01	Tumor
 45 | TCGA-BH-A0E0-11	Normal
 46 | TCGA-BH-A0E1-01	Tumor
 47 | TCGA-BH-A0E1-11	Normal
 48 | TCGA-BH-A0H7-01	Tumor
 49 | TCGA-BH-A0H7-11	Normal
 50 | TCGA-BH-A0H9-01	Tumor
 51 | TCGA-BH-A0H9-11	Normal
 52 | TCGA-BH-A0HK-01	Tumor
 53 | TCGA-BH-A0HK-11	Normal
 54 | TCGA-BH-A0BC-01	Tumor
 55 | TCGA-BH-A0BC-11	Normal
 56 | TCGA-BH-A0DH-01	Tumor
 57 | TCGA-BH-A0DH-11	Normal
 58 | TCGA-BH-A0DQ-01	Tumor
 59 | TCGA-BH-A0DQ-11	Normal
 60 | TCGA-BH-A0B7-01	Tumor
 61 | TCGA-BH-A0B7-11	Normal
 62 | TCGA-BH-A0BQ-01	Tumor
 63 | TCGA-BH-A0BQ-11	Normal
 64 | TCGA-BH-A0BW-01	Tumor
 65 | TCGA-BH-A0BW-11	Normal
 66 | TCGA-BH-A0DL-01	Tumor
 67 | TCGA-BH-A0DL-11	Normal
 68 | TCGA-BH-A0H5-01	Tumor
 69 | TCGA-BH-A0H5-11	Normal
 70 | TCGA-BH-A0DO-01	Tumor
 71 | TCGA-BH-A0DO-11	Normal
 72 | TCGA-BH-A0DT-01	Tumor
 73 | TCGA-BH-A0DT-11	Normal
 74 | TCGA-BH-A18J-01	Tumor
 75 | TCGA-BH-A18J-11	Normal
 76 | TCGA-A7-A13E-01	Tumor
 77 | TCGA-A7-A13E-11	Normal
 78 | TCGA-A7-A13F-01	Tumor
 79 | TCGA-A7-A13F-11	Normal
 80 | TCGA-BH-A0AU-01	Tumor
 81 | TCGA-BH-A0AU-11	Normal
 82 | TCGA-BH-A0AZ-01	Tumor
 83 | TCGA-BH-A0AZ-11	Normal
 84 | TCGA-BH-A0B5-01	Tumor
 85 | TCGA-BH-A0B5-11	Normal
 86 | TCGA-BH-A0BS-01	Tumor
 87 | TCGA-BH-A0BS-11	Normal
 88 | TCGA-BH-A0BT-01	Tumor
 89 | TCGA-BH-A0BT-11	Normal
 90 | TCGA-BH-A0BZ-01	Tumor
 91 | TCGA-BH-A0BZ-11	Normal
 92 | TCGA-BH-A0C3-01	Tumor
 93 | TCGA-BH-A0C3-11	Normal
 94 | TCGA-BH-A0DD-01	Tumor
 95 | TCGA-BH-A0DD-11	Normal
 96 | TCGA-BH-A0DG-01	Tumor
 97 | TCGA-BH-A0DG-11	Normal
 98 | TCGA-BH-A0DV-01	Tumor
 99 | TCGA-BH-A0DV-11	Normal
100 | TCGA-BH-A0HA-01	Tumor
101 | TCGA-BH-A0HA-11	Normal
102 | 


--------------------------------------------------------------------------------
/HW3/q1_data/diagnosis.txt:
--------------------------------------------------------------------------------
 1 | sample	phenotype
 2 | Test1	Normal
 3 | Test2	Tumor
 4 | Test3	Tumor
 5 | Test4	Normal
 6 | Test5	Tumor
 7 | Test6	Normal
 8 | Test7	Tumor
 9 | Test8	Normal
10 | Test9	Tumor
11 | Test10	Normal
12 | Test11	Tumor
13 | Test12	Normal
14 | Test13	Tumor
15 | Test14	Normal
16 | Test15	Tumor
17 | Test16	Normal
18 | Test17	Tumor
19 | Test18	Normal
20 | Test19	Tumor
21 | Test20	Tumor
22 | 


--------------------------------------------------------------------------------
/HW4/README.md:
--------------------------------------------------------------------------------
1 | # Homework-4
2 | 
3 | - Due: March 29, 2020 at 11:59pm
4 | 


--------------------------------------------------------------------------------
/HW4/Stat115_Homework4.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Stat 115 2020: Homework 5'
  3 | author: '(Your name)'
  4 | date: "Due: March 29, 2020 at 11:59pm"
  5 | output: html_document
  6 | ---
  7 | 
  8 | Androgen receptor (AR) is a transcription factor frequently over-activated in prostate cancer. To study AR regulation in prostate cancer, scientists conducted AR ChIP-seq in prostate tumors and normal prostate tissues. Since the difference between individual patients could be quite big, this study actually included many more tumor and normal samples. However, for the purpose of this HW, we will only use the ChIP-seq data from 1 prostate tumor samples (tumor) and 1 normal prostate tissues (normal). 
  9 | 
 10 | Hint: It helps to read the MACS README and Nature Protocol paper:
 11 | 
 12 | https://pypi.python.org/pypi/MACS2/2.0.10.09132012
 13 | 
 14 | https://search-proquest-com.ezp-prod1.hul.harvard.edu/docview/1036979599/fulltextPDF/7A4604F292854FFAPQ/1?accountid=11311 
 15 | 
 16 | # Part I. Call AR ChIP-seq peaks
 17 | 
 18 | ## 1. For GRADUATE students: 
 19 | 
 20 | Usually we use BWA to map the reads to the genome for ChIP-seq experiment. We will give you one example ChIP-seq single-end sequenced .fastq file with only 1M reads. Run BWA on this file to Hg38 of the human genome assembly. Report the commands, logs files, and a snapshot / screenshot of the output to demonstrate your alignment procedure. What proportion of the reads are successfully mapped (to find at least one location) and what proportions are uniquely mapped (to find a single location) in the human genome in this test sample? We will save you some time and directly give you the BWA mapped BAM files for all the 4 samples.
 21 | 
 22 | Hint: 
 23 | 1). Target sample fastq file is stored as /n/stat115/2020/HW4/tumor_1M.fastq on the Odyssey
 24 | 2). The index file is stored as /n/stat115/2020/HW1/bwa_hg38_index/hg38.fasta on the Odyssey
 25 | 
 26 | ```{r, engine='bash', eval=FALSE}
 27 | # your bash code here
 28 | ```
 29 | 
 30 | ## 2. For GRADUATE students:
 31 | 
 32 | In ChIP-Seq experiments, when sequencing library preparation involves a PCR amplification step, it is common to observe multiple reads where identical nucleotide sequences are disproportionally represented in the final results. This is especially a problem in tissue ChIP-seq experiments (as compared to cell lines) when input cell numbers are low. Removing these duplicated reads can improve the peak calling accuracy. Thus, it may be necessary to perform a duplicate read removal step, which flags identical reads and subsequently removes them from the dataset. Run this on your test sample (1M reads) (macs2 filterdup). What % of reads are redundant? When doing peak calling, MACS filters duplicated reads by default. 
 33 | 
 34 | Hint:
 35 | The test samples are stored as /n/stat115/2020/HW4/tumor.bam and /n/stat115/2020/HW4/normal.bam on the Odyssey.
 36 | 
 37 | ```{r, engine='bash', eval=FALSE}
 38 | # your bash code here
 39 | ```
 40 | 
 41 | ## 3. For both:
 42 | 
 43 | For many ChIP-seq experiments, usually chromatin input without enriching for the factor of interest is generated as control. However, in this experiment, we only have ChIP (of both tumor and normal) and no control samples. Without control, MACS2 will use the signals around the peaks to infer the chromatin background and estimate the ChIP enrichment over background. In ChIP-seq, + strand reads and – strand reads are distributed to the left and right of the binding site, and the distance between the + strand reads and – strand reads can be used to estimate the fragment length from sonication (note: with PE seq, insert size could be directly estimated). What is the estimated fragment size in each? Use MACS2 to call peaks from tumor1 and normal1 separately. How many peaks do you get from each condition with FDR < 0.05 and fold change > 5? 
 44 | 
 45 | ```{r, engine='bash', eval=FALSE}
 46 | # your bash code here
 47 | ```
 48 | 
 49 | ## 4. For both:
 50 | 
 51 | Now we want to see whether AR has differential binding sites between prostate tumors and normal prostates. MACS2 does have a function to call differential peaks between conditions, but requires both conditions to have input control. Since we don’t have input controls for these AR ChIP-seq, we will just run the AR tumor ChIP-seq over the AR normal ChIP-seq (pretend the latter to be input control) to find differential peaks. How many peaks do you get with FDR < 0.01 and fold change > 6?
 52 | 
 53 | ```{r, engine='bash', eval=FALSE}
 54 | # your bash code here
 55 | ```
 56 | 
 57 | 
 58 | 
 59 | # Part II. Evaluate AR ChIP-seq data quality 
 60 | 
 61 | ## 5. For both:
 62 | 
 63 | Cistrome Data Browser (http://cistrome.org/db/) has collected and pre-processed most of the published ChIP-seq data in the public. Play with Cistrome DB. Biological sources indicate whether the ChIP-seq is generated from a cell line (e.g. VCaP, LNCaP, PC3, C4-2) or a tissue (Prostate). Are there over 10 AR ChIP-seq data available in human prostate tissues? 
 64 | 
 65 | ## 6. For both:
 66 | 
 67 | Doing transcription factor ChIP-seq in tissues could be a tricky experiment, so sometimes even published studies have very bad data. Look at a few AR ChIP-seq samples in the prostate tissue on Cistrome and inspect their QC reports. Can you comment on what QC measures tell you whether a ChIP-seq is of good or bad quality? Include a screen shot of a good AR ChIP-seq vs a bad AR ChIP-seq. 
 68 | 
 69 | ## 7. For GRADUATE students:
 70 | 
 71 | For Graduate Students: Antibody is one important factor influencing the quality of a ChIP-seq experiment. Click on the GEO (GSM) ID of some good quality vs bad quality ChIP-seq data, and see where they got their AR antibodies. If you plan to do an AR ChIP-seq experiment, which company and catalog # would you use to order the AR antibody? 
 72 | 
 73 | 
 74 | 
 75 | # Part III Find AR ChIP-seq motifs
 76 | 
 77 | ## 8. For GRADUATE students:
 78 | 
 79 | We want to see in prostate tumors, which other transcription factors (TF) might be collaborating with AR. Try any of the following motif finding tools to find TF motifs enriched in the differential AR peaks you identified above. Did you find the known AR motif, and motifs of other factors that might interact with AR in prostate cancer in gene regulation? Describe the tool you used, what you did, and what you found. Note that finding the correct AR motif is usually an important criterion for AR ChIP-seq QC.
 80 | 
 81 | Cistrome: http://cistrome.org/ap/root (Register a free account).
 82 | Weeder: http://159.149.160.88/pscan_chip_dev/ 
 83 | HOMER: http://homer.ucsd.edu/homer/motif/ 
 84 | MEME: http://meme-suite.org/tools/meme-chip 
 85 | 
 86 | ## 9. For both: 
 87 | 
 88 | Look at the AR binding distribution in Cistrome DB from a few good AR ChIP-seq data in prostate. Does AR bind mostly in the gene promoters, exons, introns, or intergenic regions? Also, look at the QC motifs to see what motifs are enriched in the ChIP-seq peaks. Do you see similar motifs here as those you found in your motif analyses? 
 89 | 
 90 | 
 91 | 
 92 | # Part IV. Identify AR-interacting transcription factors
 93 | 
 94 | ## 10. For GRADUATE students:
 95 | 
 96 | Sometimes members of the same transcription factor family (e.g. GATA1, 2, 3, 4, 5, 6) have similar binding motifs, similar binding sites (when they are expressed, although they might be expressed in very different tissues), and related functions. Therefore, to confirm that we have found the correct TFs interacting with AR in prostate tumors, in addition to looking for motifs enriched in the AR ChIP-seq, we also want to see whether the TFs are highly expressed in prostate tumor. For this, we will use the Exploration Component on TIMER (http://timer.cistrome.org/). First, try the “Gene DE” module to look at differential expression of genes in tumors. Check the top motifs you found before, and see which member of the TF family that recognizes the motif is highly expressed in prostate tissues or tumors. Another way is to see whether the TF family member and AR have correlated expression pattern in prostate tumors. Go to the “Gene Corr” tab, select prostate cancer (PRAD), enter AR as your interested gene and genes (you can under multiple genes here) that are potential AR collaborators based on the motif, correct the correlation by tumor purity, and see whether the candidate TF is correlated with AR in prostate tumors. Based on the motif and expression evidences, which factor in each motif family is the most likely collaborator of AR in prostate cancer?
 97 | 
 98 | Note: When we conduct RNA-seq on prostate tumors, each tumor might contain cancer cells, normal prostate epithelia cells, stromal fibroblasts, and other immune cells. Therefore, genes that are highly expressed in cancer cells (including AR) could be correlated in different tumors simply due to the tumor purity bias. Therefore, when looking for genes correlated with AR just in the prostate cancer cells, we should correct this tumor purity bias. 
 99 | 
100 | ## 11. For both:
101 | 
102 | Besides looking for motif enrichment, another way to find TFs that might interact with AR is to see whether there are other TF ChIP-seq data which have significant overlap with AR ChIP-seq. Take the differential AR ChIP-seq peaks (in .bed format) that are significantly higher in tumor than normal, and run this on the Cistrome Toolkit (http://dbtoolkit.cistrome.org/). The third function in Cistrome Toolkit looks through tens of thousands of published ChIP-seq data to see whether any have significant overlap with your peak list. You should see AR enriched in the results (since your input is a list of AR ChIP-seq peaks after all). What other factors did you see enriched? Do they agree with your motif analyses before? 
103 | 
104 | 
105 | 
106 | # PART V. Find AR direct target genes and pathways
107 | 
108 | ## 12. For GRADUATE students:
109 | 
110 | Now we try to see what target genes these AR binding sites regulate. Among the differentially expressed genes in prostate cancer, only a subset might be directly regulated by AR binding. One simple way of getting the AR target genes is to look at which genes have AR binding in its promoters. Write a python program that takes two input files: 1) the AR differential ChIP-seq peaks in tumor over normal; 2) refGene annotation. The program outputs to a file containing genes that have AR ChIP-seq peak (in this case, stronger peak in tumor) within 3KB + / - from the transcription start site (TSS) of the gene. How many putative AR target genes in prostate cancer do you get using this approach? 
111 | 
112 | Note: From UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/), download the human RefSeq annotation table (find the file refGene.txt.gz for Hg38). To understand the columns in this file, check the query annotation at http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.sql. 
113 | 
114 | Hint: TSS is different for genes on positive or negative strand, i.e. TSS is “txStart” for genes on the positive strand, “txEnd” for genes in negative strand. When testing your python code, try smaller number of gene annotations or smaller number of peaks to check your results before moving forward. 
115 | 
116 | ## 13. For GRADUATE students:
117 | 
118 | Now overlap the putative AR target genes you get from above with up regulated genes in prostate cancer(up_regulated_genes_in_prostate_cancer.txt). Try to run DAVID on 1) the AR target genes from binding alone and 2) the AR target genes by overlapping AR binding with differential expression. Are there enriched GO terms or pathways? 
119 | 
120 | ## 14. For both:
121 | 
122 | Another way of getting the AR target genes is to consider the number of AR binding sites within 100KB of TSS, but weight each binding site by an exponential decay of its distance to the gene TSS (i.e. peaks closer to TSS have higher weights). For this, we have calculated regulatory potential score for each refseq gene(AR_peaks_regulatory_potential.txt). Select the top 1500 genes with highest regulatory potential score, try to run DAVID both with and without differentially expression, and see the enriched GO terms. 
123 | 
124 | Note: Basically this regulatory potential approach assumes that there are stronger AR targets (e.g. those genes with many AR binding sites within 100KB and have stronger differential expression) and weaker AR targets, instead of a binary Yes / No AR targets. 
125 | 
126 | ## 15. For GRADUATE students:
127 | 
128 | Comment on the AR targets you get from promoter binding (your code) and distance weighted binding. Which one gives you better function / pathway enrichment? Does considering differential expression help?
129 | 
130 | 
131 | 
132 | # PART VI. ATAC-seq
133 | 
134 | The molecular mechanism of a type of T cell leukemia is poorly understood. Since it is unclear which transcription factors (TF) are involved, scientists can’t do TF ChIP-seq. Instead, ATAC-seq was performed on the T cells from both the normal donors and the T cell leukemia patients on many individuals. For this HW, we will only select 3 normal (norm1, norm2, norm3) and 3 leukemia (leuk1, leuk2, leuk3) samples, and give you the read mapping BAM files (to Hg38). This part of the HW will show you how epigenetic profiling can help identify key transcription factors and the regulatory mechanisms of biological processes and diseases. 
135 | 
136 | Unlike ChIP-seq which often uses chromatin input as controls, ATAC-seq has no control samples. The best way to call differential ATAC-seq peaks between the tumor and normal is to obtain the union of tumor and normal ATAC-seq peaks, extract the read counts from all the 6 samples in the union peaks, then run DESeq2 on them to find differential peaks. SAMTools (http://samtools.sourceforge.net/) and BEDTools (https://bedtools.readthedocs.io/en/latest/) are extremely useful tools to manipulate SAM/BAM and BED files. Let’s try them here.
137 | 
138 | 
139 | ## 16. For both:
140 | 
141 | One way of getting the union peak is to run MACS on each of the samples separately, then use BEDTools to merge the peaks together. E.g. if we use MACS to run peak calling on norm1 (norm1.bed) and leuk1 (leuk1.bed), can you merge the two sets of peaks into one merge.bed file using BEDTools? How many peaks can you return? (Hint: MACS2 FDR cutoff 0.01 on each sample first).
142 | 
143 | Hint: 
144 | All the bam files are stored under /n/stat115/2020/HW4/Part_VI.
145 | Please refer to https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85853 to verify whether the bam files contain data collected from a normal donor or a leukemic donor.
146 | 
147 | ```{r, engine='bash', eval=FALSE}
148 | # your bash code here
149 | ```
150 | 
151 | 
152 | ## 17. For GRADUATE students:
153 | 
154 | Another way of calling the union peaks is to concatenate all the 6 BAM files together, then run MACS. We have done this already (union.bed). Use BEDTools to calculate the Jaccard index between the union.bed and merge.bed you got in Q1. Jaccard index between set A and set B is defined as (A $\cap$ B)/(A $\cup$ B). 
155 | 
156 | ```{r, engine='bash', eval=FALSE}
157 | # your bash code here
158 | ```
159 | 
160 | ## 18. For both:
161 | 
162 | Extract the reads from the six BAM files in the union.bed peaks. Either the BEDTools multicov function or SAMTools bedcov function can achieve this, and generate a read count matrix on the peaks in the six files. Draw a PCA plot of the resulting matrix.
163 | 
164 | ```{r, engine='bash', eval=FALSE}
165 | # your bash code here
166 | ```
167 | 
168 | ## 19. For both:
169 | 
170 | Run DESeq2 on the six samples to identify differential ATAC-seq peaks between the 3 leukemia and 3 normal samples. How many peaks are leukemia specific or normal specific at FDR < 0.05? 
171 | 
172 | ```{r}
173 | # your code here
174 | ```
175 | 
176 | ## 20. For both:
177 | 
178 | Take the leukemia-specific ATAC-seq peaks, and run them on Cistrome Toolkit to see what public ChIP-seq have significant overlap with them. What transcription factors might be important in regulating this type of leukemic T cells?
179 | 
180 | ## 21. For Graduate Students: 
181 | 
182 | In Q10, we mentioned that sometimes members of the same transcription factor family have similar binding motifs, similar binding sites (when they are expressed, although they might be expressed in very different tissues), and related functions. Supposedly we don’t have RNA-seq of these samples to calculate the expression level of the TF. However, we can use regulatory potential to assign the ATAC-seq peaks to genes to infer the expression level of a gene (i.e. a gene with many ATAC-seq peaks near its TSS is often expressed at higher level), and see whether the inferred TF might have higher expression in leukemia than normal. Could you describe (not necessarily do it) how to refine the hypothesis on the specific TFs that might regulate this type of leukemic T cells? 
183 | 
184 | # Rules for submitting the homework:
185 | 
186 | Please submit your solution directly on the canvas website. Please
187 | provide both your code in this Rmd document and an html file for your
188 | final write-up. Please pay attention to the clarity and cleanness of
189 | your homework.
190 | 
191 | The teaching fellows will grade your homework and give the grades with
192 | feedback through canvas within one week after the due date. Some of the
193 | questions might not have a unique or optimal solution. TFs will grade
194 | those according to your creativity and effort on exploration, especially
195 | in the graduate-level questions.
196 | 
197 | 


--------------------------------------------------------------------------------
/HW5/code/STAT115_HW5_2020.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "STAT 115 Homework 5"
  3 | author: "(your name)"
  4 | date: "Due: Sunday 4/12/2020 by 11:59 pm"
  5 | output: html_document
  6 | ---
  7 | 
  8 | # Part I. Hidden Markov Model and TAD boundaries
  9 | 
 10 | Topologically associating domains (TADs) define genomic intervals, where sequences within a TAD physically interact more frequently with each other than with sequences outside the TAD. TADs are often defined by HiC (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3149993/), an experimental technique designed to study the three-dimensional architecture of genomes. HiC generates PE sequenced data, where the two mate pairs indicate two genomic regions that are might be far apart in the genome, but physically interact with each other. If we look across the genome in bins (40kb in the early paper, but now can go down to 5-10kb with deeper sequencing), we could find reads that are mapped there and check whether their interacting mate pairs are mapped upstream or downstream. In each bin, we can calculate a directional index (DI) to quantify the degree of upstream or downstream bias of a given bin (for more details, see the supplement- `Supplement_10.1038_nature11082.pdf` ). For this HW, we ask you to implement a hidden Markov Model (Viterbi) to find regions with upstream bias (DI < 0) and those with downstream bias (DI > 0), even though the DI in individual bins might have some noise. This way, TAD boundaries could be discovered as clusters of bins from negative DIs to positive DIs (see Supplementary Figure 12b). 
 11 | 
 12 | For simplicity, we will only have two hidden states (upstream, downstream), and use the following HMM parameters (these do not necessarily capture the real data distribution, but just to help your implementation): 
 13 | 
 14 | ```
 15 | Initial probability: upstream = 0.5, downstream = 0.5
 16 | Transition probability: Pb(up to up) = Pb(dn to dn) = 0.9, Pb(up to dn) = Pb(dn to up) = 0.1
 17 | 
 18 | Emission probabilities: 
 19 | P{<-1200, [-1200,-800), [-800,-500), [-500,0), [0,500), [-500,800), [800, 1200), >= 1200 | upstream} = (0.01, 0.01, 0.02, 0.04, 0.65, 0.15, 0.08, 0.04)
 20 | P{<-1200, [-1200,-800), [-800,-500), [-500,0), [0,500), [-500,800), [800, 1200), >= 1200 | downstream} = (0.04, 0.08, 0.15, 0.65, 0.04, 0.02, 0.01, 0.01)
 21 | 
 22 | ```
 23 | 
 24 | #### I.1 Given the DI file (`HW5_ESC.Dixon_2015.DI.chr21.txt`), implement and utilize the Viterbi algorithm to predict the hidden states of the Hi-C data. Visualize your result with a graph utilizing the following: midpoint of genomic bin on the x axis; DI score per bin on the y-axis; color: hidden state of the HMM. 
 25 | 
 26 | Hint: Examples HMM code can be found at: 
 27 | http://www.adeveloperdiary.com/data-science/machine-learning/implement-viterbi-algorithm-in-hidden-markov-model-using-python-and-r/
 28 | 
 29 | 
 30 | ```{r}
 31 | 
 32 | data <- read.table("../data/HW5_ESC.Dixon_2015.DI.chr21.txt", col.names = c("chr", "start", "end", "DI"))
 33 | data$mid <- (data$end + data$start)/ 2
 34 | 
 35 | # Hint: make discrete states from the continuous directionability index
 36 | obs_states <- cut(data$DI, breaks = c(min(data$DI)-1,-1200, -800, -500, 0, 500, 800, 1200, max(data$DI)+1), right = FALSE) 
 37 | 
 38 | ```
 39 | 
 40 | # Part II. Single cell ATAC-seq
 41 | 
 42 | For this exercise, we will be analyzing a single cell ATAC-Seq dataset of human peripheral blood mononuclear cells (PBMC) from the 10X Genomics platform. There are around 5,000 single cells that were sequenced on the Illumina NovaSeq. The raw data can be found at: https://support.10xgenomics.com/single-cell-atac/datasets/1.2.0/atac_pbmc_5k_v1. A processed Seurat scRNA-seq object used in the lab will be reused for the assignment and is available here: https://github.com/stat115/Lab_2020/blob/master/Lab09/scrna_source/output/PBMC5k_scRNAseq-for-integration.rds.
 43 | 
 44 | 
 45 | #### II.1 Read the 10X data and create a Seurat object that stores the reads in peaks count matrix. Filter cells with fewer than 5000 counts (from the `passed_filters` variable) How many cells are retained and how many are excluded? 
 46 | 
 47 | 
 48 | #### II.2 Quantify the gene activity for each cell using the `FeatureMatrix` function from Signac. Include your code below.
 49 | 
 50 | 
 51 | #### II.3 Process the gene activity matrix by scaling and normalizing using Signac (`NormalizeData()`)
 52 | 
 53 | 
 54 | #### II.4 Process the peak matrix. a) Perform latent semantic indexing (LSI) to reduce the dimensionality of the scATAC-seq data. Reduce the dimension to 50. b) Run UMAP on the first 20 dimensions but excluding the first component. c) Cluster all the cells using `resolution = 0.6` and visualize these clusters on a UMAP embedding. Comment on why we recommended excluding the first LSI component.
 55 | 
 56 | 
 57 | #### II.5 Read in the pre-processed and clustered scRNA-seq dataset, which is provided as part of the homework and was generated for the lab exercise. Then identify anchors between the scATAC-seq dataset and the scRNA-seq dataset and use these anchors to transfer cell type labels from scRNA-seq to scATAC-seq cells. Visualize the predicted cell types on the UMAP plot of scATAC-seq data. 
 58 | 
 59 | 
 60 | #### II.6 [Graduate Students] Create a matrix heatmap of cluster IDs from the Seurat clusters from scATAC data with the predicted celltypes from scRNA-seq. Describe what clusters appears to map 1 to 1 between the modalities and which clusters appear split? (Hint: use the `pbmc@meta.data` data frame and `dplyr::group_by`)
 61 | 
 62 | 
 63 | #### II.7 [Graduate Students] Using the transferred cell state annotations, find the differential peaks between the two clusters of B-cells (activated and memory). Visualize two of the top accessibility peaks that are different. Are the accessibility peaks visualized restricted to a particular celltype or present in other PBMC celltypes? 
 64 | 
 65 | 
 66 | #### II.8 [Graduate Students] Perform a motif analysis to identify motifs that are over-represented in the differential peaks between the activated and memory B-cells. Visualize the top two motifs that are differential between the B-cells. 
 67 | 
 68 | 
 69 | # Part III: GWAS Followup
 70 | 
 71 | The NHGRI-EBI GWAS Catalog is a curated dataset of trait-associated genetic variants for human. While it provides association between single-nucleotide polymorphisms (SNPs) and trait (i.e. cancer), the genetic variants in GWAS catalog are not necessarily causative or functional for a trait, since SNPs can be highly correlated measured by linkage disequilibrium (LD). To learn the potential functional effect of a certain SNP, especially the non-coding variants, we can use RegulomeDB to explore the potential function of the SNP.
 72 | 
 73 | You will explore the following online resources: The NHGRI-EBI GWAS catalog (https://www.ebi.ac.uk/gwas/), dbSNP (https://www.ncbi.nlm.nih.gov/snp/ ), LDLink (https://ldlink.nci.nih.gov/), and RegulomeDB (the beta version http://regulomedb.org or the more stable older version http://legacy.regulomedb.org/).
 74 | 
 75 | #### III.1 Explore whether there are genetic variants within the gene BRCA2 which are associated with any traits. What traits are associated with the BRCA2 variants? Which SNP has the smallest p-value related to breast cancer? What is the risk allele?
 76 | 
 77 | 
 78 | #### III.2 For the BRCA2 SNP with most significant association with breast cancer, what consequence does the risk allele have on the BRCA2 protein sequence? Based on 1000 Genomes in LDLink, what is the allele frequency of the risk allele among the 5 ethnicities In the population with the highest risk in the resource, what is the expected number of people with heterozygous genotype at this SNP, assuming linkage disequilibrium?
 79 | 
 80 | 
 81 | #### III.3 Explore a certain SNP, rs4784227, that was reported to be associated with breast cancer. Is it an intergenic, exonic or intronic variant? What gene does it fall in?  
 82 | 
 83 | #### III.4 Explore the SNP rs4784227 in RegulomeDB. What functional category does the rank score (or Regulome DB Score) implicate? What factors does RegulomeDB take into consideration while scoring the potential function of SNPs?
 84 | 
 85 | 
 86 | #### III.5 Describe the evidence that implicate the regulatory potential of rs4784227, for example, list several transcription factors with binding peaks overlapping this SNP; report the cell types with open chromatin regions overlapping this SNP. 
 87 | 
 88 | 
 89 | #### III.6 [Graduate Students] Read the paper by Cowper-Sal et al. (PMID 23001124) and summarize the potential mechanisms of the above SNP’s function in terms of affecting transcription factor-DNA interaction and regulating genes.
 90 | 
 91 | 
 92 | # Part IV: COVID19 Genomics
 93 | 
 94 | We are currently fighting an epidemic due to the SARS-CoV-2 virus. As more viruses from infected individuals are sequenced, the epidemiology of this pathogen is becoming better understood. Nextstrain (https://nextstrain.org/ncov) is an online resource that aggregates and tracks public sequencing data of the virus. Using screen shots to support your answers, address the following questions related to SARS-CoV-2:
 95 | 
 96 | #### IV.1 Determine the main clades of the virus as well as the main nucleotide and protein changes that define the clades. What are the genes associated with each mutation?
 97 | 
 98 | 
 99 | #### IV.2 Identify the main clade affecting four of the countries most severely affected by SAR-CoV-2: China, United States, Iran, and Italy. 
100 | 
101 | 
102 | #### IV.3 The countries of Georgia, Democratic Republic of Congo, and Brazil have relatively few (but non-zero!) cases of SARS-CoV-2. Using the Nextstrain data, speculate the most likely countries where the virus was transmitted from. 
103 | 
104 | 
105 | #### IV.4 The spike protein (S) is currently the target of several therapeutic approaches and vaccines. Understanding which cases have mutated residues of this protein is of considerable importance. For the variant in this protein with the highest minor allele frequency, visualize the proportion of these cases of the virus world-wide. 
106 | 
107 | 
108 | #### IV.5 [Graduate Students] Preliminary reports from the New England Journal of Medicine (`nejmoa2002032.pdf`) suggest that men may be more susceptible than women. Using the metadata from Nextstrain, evaluate whether you can corroborate this finding. Further, determine whether the clade of the virus differentially affects men or women. Support your answers with statistical analyses.  
109 | 
110 | 
111 | #### IV.6 [Graduate Students] For each country in the data reported from NextStrain, determine the clade that is responsible for the most cases and the percent of cases (per country).
112 | 
113 | 
114 | 


--------------------------------------------------------------------------------
/HW5/papers/PMID23001124.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW5/papers/PMID23001124.pdf


--------------------------------------------------------------------------------
/HW5/papers/Supplement_10.1038_nature11082.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW5/papers/Supplement_10.1038_nature11082.pdf


--------------------------------------------------------------------------------
/HW5/papers/nejmoa2002032.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat115/Homework_2020/fc9e4dcb86946485fa40308f5dcb5bfeff9cb29c/HW5/papers/nejmoa2002032.pdf


--------------------------------------------------------------------------------
/HW6/README.md:
--------------------------------------------------------------------------------
1 | ## Homework6
2 | Due: April 29 @11:59pm
3 | 


--------------------------------------------------------------------------------
/HW6/Stat115_Homework6.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Stat 115 2020: Homework 6'
  3 | author: '(Your Name Here)'
  4 | date: "Due: Wed, April 29, 2020 at 11:59pm"
  5 | output: html_document
  6 | ---
  7 | 
  8 | ```{r}
  9 | #Load packages that you might use
 10 | 
 11 | #Run `devtools::install_github("mariodeng/FirebrowseR")` if not installed
 12 | library(FirebrowseR)
 13 | library(limma)
 14 | library(ggplot2)
 15 | library(scales)
 16 | library(survival)
 17 | library(magrittr)
 18 | library(data.table)
 19 | library(knitr)
 20 | library(glmnet, quietly = TRUE)
 21 | library(MAGeCKFlute)
 22 | library(dplyr)
 23 | library(tidyr)
 24 | library(biobroom)
 25 | library(survminer)
 26 | ```
 27 | 
 28 | 
 29 | # Part I. Data exploration on TCGA
 30 | 
 31 | The Cancer Genome Atlas (TCGA) is an NCI project to comprehensively profile > 10K tumors in 33 cancer types. In this homework, we are going to explore TCGA data analysis. 
 32 | 
 33 | ## 1. For both: 
 34 | 
 35 | Go to TCGA GDC website (https://portal.gdc.cancer.gov/) and explore the GDC data portal. How many glioblastoma (GBM) cases in TCGA meet ALL of the following requirements?
 36 | 1. Male; 
 37 | 2. Diagnosed at the age above 45; 
 38 | 3. Still alive.
 39 | 
 40 | 
 41 | ## 2. For both:
 42 | 
 43 | TCGA GDC (https://portal.gdc.cancer.gov/) and Broad Firehose (http://firebrowse.org/) both provide processed TCGA data for downloading and downstream analysis. Download clinical data of GBM. What’s the average diagnosed age of all GBM patients?
 44 | 
 45 | ```{r}
 46 | # your code here
 47 | 
 48 | ```
 49 | 
 50 | 
 51 | 
 52 | # Part II. Tumor Subtypes
 53 | 
 54 | You are given a number of TCGA glioblastoma (GBM) samples and 10 commercially available normal brains (it is unethical to take matched normal brain from GBM tumor patients), including their expression, DNA methylation, mutation profiles as well as patient survival. Please note that we only selected a subset of the samples to make this HW, which were simplified to give students a flavor of cancer genomics studies, so some findings from these data might not reflect the real biology of GBM. 
 55 | 
 56 | 
 57 | ## 1. For both:
 58 | 
 59 | GBM is one of the earliest cancer types to be processed by TCGA, and the expression profiling was initially done with Affymetrix microarray. Also, with brain cancer, it is hard to get sufficient number of normal samples. We provide the pre-processed expression matrix in (GBM_expr.txt) where samples are columns and genes are rows. Do a K-means (k=3) clustering from all the genes and the most variable 2000 genes. Do tumor and normal samples separate in different clusters? Do the tumors samples consistently separate into 2 clusters, regardless of whether you use all the genes or most variable genes?
 60 | 
 61 | ```{r}
 62 | # your code here
 63 | 
 64 | ```
 65 | 
 66 | ## 2. For both:
 67 | 
 68 | LIMMA is a BioConductor package that does differential expression between microarrays, RNA-seq, and can remove batch effects (especially if you have experimental design with complex batches). Use LIMMA to see how many genes are differentially expressed between the two GBM subtypes (with FDR < 0.05 and logFC > 1.5)? 
 69 | 
 70 | ```{r}
 71 | # your code here
 72 | 
 73 | ```
 74 | 
 75 | ## 3. For GRADUATE students:
 76 | 
 77 | From the DNA methylation profiles (GBM_meth.txt), what are the genes significantly differentially methylated between the two subtypes? Are DNA methylation associated with higher or lower expression of these genes? How many differentially expressed genes have an epigenetic (DNA methylation) cause (i.e. how many differentially expressed genes are also differentially methylated)?  
 78 | 
 79 | ```{r}
 80 | # your code here
 81 | 
 82 | ```
 83 | 
 84 | ## 4. For both:
 85 | 
 86 | With the survival data of the GBM tumors (GBM_clin.txt), make a Kaplan-Meier Curve to compare the two subtypes of GBM patients. Is there a significant difference in patient outcome between the two subtypes? 
 87 | 
 88 | ```{r}
 89 | # your code here
 90 | 
 91 | ```
 92 | 
 93 | ## 5. For GRADUATE students:
 94 | 
 95 | Use the differential genes (say this is Y number of genes) between the two GBM subtypes as a gene signature to do a Cox regression of the tumor samples. Does it give significant predictive power of patient outcome?
 96 | 
 97 | ```{r}
 98 | # your code here
 99 | 
100 | ```
101 | 
102 | ## 6. For GRADUATE students:
103 | 
104 | Many studies use gene signatures to predict prognosis of patients. Take a look at this paper: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002240.
105 | It turns out that most published gene signatures are not significantly more associated with outcome than random predictors.
106 | 
107 | Write a script to randomly sample Y genes in this expression data as a gene signature and do Cox regression on the sampled signature to predict patient outcome. Automate the script and random sample followed by Cox regression 100 times. How does your signature in Q5 compared to random signatures in predicting outcome? 
108 | 
109 | ```{r}
110 | # your code here
111 | 
112 | ```
113 | 
114 | # Part III. Tumor mutation analyses and precision medicine
115 | 
116 | ## 1. For both:
117 | 
118 | The MAF files contain the mutations of each tumor compared to the normal DNA in the patient blood. Write a script to parse out the mutations present in each tumor sample and write out a table. The table should rank the genes by how many times mutation happens in the tumor samples provided. Submit the table with the top 20 genes.  
119 | 
120 | ```{r}
121 | # your code here
122 | 
123 | ```
124 | 
125 | ## 2. For both:
126 | 
127 | Existing clinical genetic testing laboratories use information about the frequency of a mutation in cohorts, like from the GBM cohort in TCGA, to assess a mutation’s clinical significance (guidelines: https://www.ncbi.nlm.nih.gov/pubmed/27993330). Of the top 20 genes in Q1, what gene has the mutation seen the most times (hint: count mutations with the exact same amino acid change as the same)? Do you think this mutation forms a genetic subtype of GBM? 
128 | 
129 | ```{r}
130 | # your code here
131 | 
132 | ```
133 | 
134 | ## 3. For both:
135 | 
136 | CBioPortal has a comprehensive list of tumor profiling results for interactive visualization. Go to cBioPortal (http://www.cbioportal.org), and select either “Glioblastoma” under “CNS/Brian” (left) or select “TCGA PanCancer Atlas Studies” under “Quick Select” (middle). Input each gene in Q1 and click Submit. From the OncoPrint tab, you can see how often each gene is mutated in GBM or all TCGA cancer types. Based on this, which of the genes in Part3 Q1 is likely to be a cancer driver gene? 
137 | 
138 | ## 4. For both:
139 | 
140 | From the Mutation tab on the cBioPortal result page, is this mutation a gain or loss of function mutation on the gene you identified from Part3 Q2?
141 | 
142 | ## 5. For both:
143 | 
144 | From cBioPortal, select Glioblastoma (TCGA provisional, which has the largest number of samples) and enter the driver mutation gene in Q2. From the Survival tab, do GBM patients with this mutation have better outcome in terms of progression free survival and overall survival? 
145 | 
146 | ## 6. For both:
147 | 
148 | You are working with an oncologist collaborator to decide the treatment option for a GBM patient. From exome-seq of the tumor, you identified the top mutation in Part3 Q2. To find out whether there are drugs that can target this mutation to treat the cancer, go to https://www.clinicaltrials.gov  to find clinical trials that target the gene in Q2. How many trials are related to glioblastoma? How many of these are actively recruiting patients which this patient could potentially join? 
149 | Hint: Search by the disease name and gene name. The file containing all the trials can be exported as a .tsv file.
150 | 
151 | ```{r}
152 | # your code here
153 | 
154 | ```
155 | 
156 | 
157 | # Part IV. CRISPR screens
158 | 
159 | We will learn to analyze CRISPR screen data from this paper: https://www.ncbi.nlm.nih.gov/pubmed/?term=26673326. To identify therapeutic targets for glioblastoma (GBM), the author performed genome-wide CRISPR-Cas9 knockout (KO) screens in patient-derived GBM stem-like cell line (GSCs0131). 
160 | 
161 | MAGeCK tutorial:
162 | https://sourceforge.net/p/mageck/wiki/Home/
163 | https://sourceforge.net/projects/mageck/
164 | 
165 | The data for the CRISPR screen is stored at /n/stat115/2020/HW6/crispr_data. There are 4 gzipped fastq files (ending in fastq.gz) which store the data, and a library.csv library file for the sgRNAs.
166 | 
167 | ## 1. For both:
168 | 
169 | Use MAGeCK to do a basic QC of the CRISPR screen data (e.g. read mapping, ribosomal gene selection, replicate consistency, etc).  
170 | 
171 | ```{r}
172 | # your code here
173 | 
174 | ```
175 | 
176 | ## 2. For both:
177 | 
178 | Analyze CRISPR screen data with MAGeCK to identify positive and negative selection genes. How many genes are selected as positive or negative selection genes, respectively, and what are their respective enriched pathways? 
179 | 
180 | ```{r}
181 | # your code here
182 | 
183 | ```
184 | 
185 | ## 3. For GRADUATE students:
186 | 
187 | Genes negatively selected in this CRISPR screen could be potential drug targets. However, if they are always negatively selected in many cells, targeting such genes might create too much toxicity to the normal cells. Go to depmap (DepMap.org) which has CRISPR / RNAi screens of over 500 human cell lines, Click “Tools” -> Data Explorer. Pick the top 3 negatively selected genes to explore. Select Gene Dependency from CRISPR (Avana) on the X axis and Omics from Expression on the Y axis, to see the relationship between the expression level of the gene and dependency (CRISPR screen selection) of the gene across ~500 cell lines. Are the top 3 genes good drug targets? 
188 | 
189 | ```{r}
190 | # your code here
191 | 
192 | ```
193 | 
194 | ## 4. For GRADUATE students:
195 | 
196 | Let’s filter out pan essential genes (PanEssential.txt) from the negatively selected genes in Q2. Take the remaining top 10 genes, and check whether those genes have drugs or are druggable from this website: http://www.oasis-genomics.org/. Go to Analysis -> Pan Cancer Report, enter the top 10 genes and check the table for druggability (more druggable for higher number on Dr). Which of these genes are druggable? 
197 | 
198 | PanEssential.txt is stored at /n/stat115/2020/HW6/crispr_data.
199 | 
200 | ```{r}
201 | # your code here
202 | 
203 | ```
204 | 
205 | # PART V. Cancer immunology and immunotherapy
206 | 
207 | Immune checkpoint inhibitors, which primarily activate CD8 T cells, have shown remarkable efficacy in melanoma (SKCM), but haven’t worked as well in GBM patients. Let’s explore the tumor immune microenvironment from TCGA data. Although the cancer patients in TCGA were not treated with immunotherapy, their response to other drugs and clinical outcome might be influenced by pre-treatment tumor immune microenvironment. 
208 | 
209 | ## 1. For both:
210 | 
211 | TIMER (http://timer.cistrome.org/) estimated the infiltration level of different immune cells of TCGA tumors using different immune deconvolution methods. CD8A and CD8B are two gene markers on CD8 T cells. On the Diff Exp tab, compare the expression level of either CD8A or CD8B between GBM and SKCM (Metastatic Melanoma). Based on this, which cancer type have more CD8 T cells? 
212 | 
213 | ## 2. For both:
214 | 
215 | On the Gene tab, select both GBM and SKCM (Metastatic Melanoma), include all 6 immune cell infiltrates. Check the following genes, PD1, PDL1, CTLA4 which are the targets of immune checkpoint inhibitors, to see whether their expression level is associated with immune cell infiltration in the GBM and SKCM tumors. Their higher expression usually indicate that T cells are in a dysfunctional state, which immune checkpoint inhibitors aim to revive. 
216 | 
217 | ## 3. For both:
218 | 
219 | On the Survival tab, select both GBM and SKCM, include all 6 immune cell infiltrates, add tumor stage and patient age as the clinical variables to conduct survival analyses. Based on the Cox PH model, what factors are the most significantly associated with patient survival in each cancer type? Plot the Kaplan-Meire curve to evaluate how each immune cell infiltrate is associated with survival. Which cells are associated with patient survival in which cancer type? 
220 | 
221 | ## 4. For GRADUATE students:
222 | 
223 | Based on the above observations, can you hypothesize why immune checkpoint inhibitors don’t work well for some GBM patients? 
224 | 
225 | # Rules for submitting the homework:
226 | 
227 | Please submit your solution directly on the canvas website. Please
228 | provide both your code in this Rmd document and an html file for your
229 | final write-up. Please pay attention to the clarity and cleanness of
230 | your homework.
231 | 
232 | The teaching fellows will grade your homework and give the grades with
233 | feedback through canvas within one week after the due date. Some of the
234 | questions might not have a unique or optimal solution. TFs will grade
235 | those according to your creativity and effort on exploration, especially
236 | in the graduate-level questions.
237 | 
238 | 


--------------------------------------------------------------------------------
/HW6/data/GBM_clin.txt:
--------------------------------------------------------------------------------
 1 | 	vital.status	days.to.death	days.to.last.followup
 2 | TCGA-02-0025	1	1300	1300
 3 | TCGA-02-0026	1	748	748
 4 | TCGA-02-0080	1	2729	2729
 5 | TCGA-02-0084	1	384	7
 6 | TCGA-02-0085	1	1561	1561
 7 | TCGA-02-0087	0	NA	1757
 8 | TCGA-02-0104	1	1977	1977
 9 | TCGA-02-0114	1	3041	3041
10 | TCGA-02-0116	1	1489	1489
11 | TCGA-02-0258	1	503	503
12 | TCGA-02-2483	0	NA	466
13 | TCGA-06-0124	1	620	123
14 | TCGA-06-0128	1	691	691
15 | TCGA-06-0129	1	1024	989
16 | TCGA-06-0146	1	611	611
17 | TCGA-06-0164	1	1731	1730
18 | TCGA-06-0194	1	142	142
19 | TCGA-06-0201	1	12	12
20 | TCGA-06-0210	1	225	151
21 | TCGA-06-0397	1	274	168
22 | TCGA-06-1805	0	NA	1031
23 | TCGA-06-2570	0	NA	958
24 | TCGA-06-5410	1	108	108
25 | TCGA-06-5412	1	138	138
26 | TCGA-06-5417	0	NA	155
27 | TCGA-06-6389	0	NA	237
28 | TCGA-08-0344	1	3524	3310
29 | TCGA-08-0346	1	256	132
30 | TCGA-08-0350	1	889	104
31 | TCGA-08-0373	1	134	134
32 | TCGA-08-0509	1	382	17
33 | TCGA-08-0510	1	130	107
34 | TCGA-12-0620	1	318	181
35 | TCGA-12-0772	1	1638	1615
36 | TCGA-12-0775	1	232	167
37 | TCGA-12-0818	1	2791	2791
38 | TCGA-12-0827	1	1179	1179
39 | TCGA-12-1090	1	231	190
40 | TCGA-14-0783	1	189	189
41 | TCGA-14-1456	0	NA	1246
42 | TCGA-14-1821	1	541	541
43 | TCGA-14-4157	0	NA	104
44 | TCGA-16-0849	0	NA	793
45 | TCGA-16-0850	1	498	496
46 | TCGA-16-1060	1	278	111
47 | TCGA-16-1460	0	NA	195
48 | TCGA-19-0962	1	20	13
49 | TCGA-19-1389	1	141	141
50 | TCGA-19-1790	1	154	154
51 | TCGA-19-2629	1	737	501
52 | TCGA-26-1442	0	NA	953
53 | TCGA-26-5133	0	NA	452
54 | TCGA-27-2521	0	NA	316
55 | TCGA-28-1756	0	NA	86
56 | TCGA-28-5209	0	NA	442
57 | TCGA-28-5218	1	157	128
58 | TCGA-32-4208	0	NA	643
59 | TCGA-32-4209	1	618	604
60 | TCGA-32-4213	0	NA	604
61 | TCGA-41-3393	1	135	135
62 | 


--------------------------------------------------------------------------------
/HW6/data/TCGA-06-5410.maf.txt:
--------------------------------------------------------------------------------
 1 | Hugo_Symbol	Entrez_Gene_Id	Center	NCBI_Build	Chromosome	Start_position	End_position	Strand	Variant_Classification	Variant_Type	Reference_Allele	Tumor_Seq_Allele1	Tumor_Seq_Allele2	dbSNP_RS	dbSNP_Val_Status	Tumor_Sample_Barcode	Matched_Norm_Sample_Barcode	Match_Norm_Seq_Allele1	Match_Norm_Seq_Allele2	Tumor_Validation_Allele1	Tumor_Validation_Allele2	Match_Norm_Validation_Allele1	Match_Norm_Validation_Allele2	Verification_Status	Validation_Status	Mutation_Status	Sequencing_Phase	Sequence_Source	Validation_Method	Score	BAM_file	Sequencer	Tumor_Sample_UUID	Matched_Norm_Sample_UUID	Genome_Change	Annotation_Transcript	Transcript_Strand	Transcript_Exon	Transcript_Position	cDNA_Change	Codon_Change	Protein_Change	Other_Transcripts	Refseq_mRNA_Id	Refseq_prot_Id	SwissProt_acc_Id	SwissProt_entry_Id	Description	UniProt_AApos	UniProt_Region	UniProt_Site	UniProt_Natural_Variations	UniProt_Experimental_Info	GO_Biological_Process	GO_Cellular_Component	GO_Molecular_Function	COSMIC_overlapping_mutations	COSMIC_fusion_genes	COSMIC_tissue_types_affected	COSMIC_total_alterations_in_gene	Tumorscape_Amplification_Peaks	Tumorscape_Deletion_Peaks	TCGAscape_Amplification_Peaks	TCGAscape_Deletion_Peaks	DrugBank	ref_context	gc_content	CCLE_ONCOMAP_overlapping_mutations	CCLE_ONCOMAP_total_mutations_in_gene	CGC_Mutation_Type	CGC_Translocation_Partner	CGC_Tumor_Types_Somatic	CGC_Tumor_Types_Germline	CGC_Other_Diseases	DNARepairGenes_Role	FamilialCancerDatabase_Syndromes	MUTSIG_Published_Results	OREGANNO_ID	OREGANNO_Values
 2 | SPTA1	6708	broad.mit.edu	37	1	158592861	158592861	+	Missense_Mutation	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr1:158592861G>A	uc001fst.1	-	43	6231	c.6032C>T	c.(6031-6033)GCC>GTC	p.A2011V		NM_003126	NP_003117	P02549	SPTA1_HUMAN	spectrin, alpha, erythrocytic 1	2011	Spectrin 19.				actin filament capping|actin filament organization|axon guidance|regulation of cell shape	cytosol|intrinsic to internal side of plasma membrane|spectrin|spectrin-associated cytoskeleton	actin filament binding|calcium ion binding|structural constituent of cytoskeleton			ovary(4)|skin(2)|upper_aerodigestive_tract(1)|breast(1)	8	all_hematologic(112;0.0378)					CAGCAGAGCGGCATAACGCTC	0.483												
 3 | C1orf112	55732	broad.mit.edu	37	1	169772375	169772375	+	Silent	SNP	C	T	T			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr1:169772375C>T	uc001ggp.2	+	6	547	c.237C>T	c.(235-237)TCC>TCT	p.S79S	C1orf112_uc001ggj.2_RNA|C1orf112_uc001ggo.2_Silent_p.S79S|C1orf112_uc001ggq.2_Silent_p.S79S|C1orf112_uc009wvt.2_5'UTR|C1orf112_uc010plu.1_Silent_p.S50S|C1orf112_uc009wvu.1_Silent_p.S50S|C1orf112_uc001ggr.2_5'UTR|C1orf112_uc010plv.1_Silent_p.S21S	NM_018186	NP_060656	Q9NSG2	CA112_HUMAN	hypothetical protein LOC55732	79											0	all_hematologic(923;0.0922)|Acute lymphoblastic leukemia(37;0.181)					CACAGGAATCCATCATTTTGG	0.363												
 4 | DLG5	9231	broad.mit.edu	37	10	79566617	79566617	+	Silent	SNP	C	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr10:79566617C>A	uc001jzk.2	-	26	4936	c.4866G>T	c.(4864-4866)GTG>GTT	p.V1622V	DLG5_uc001jzi.2_Silent_p.V377V|DLG5_uc001jzj.2_Silent_p.V1037V|DLG5_uc009xru.1_RNA	NM_004747	NP_004738	Q8TDM6	DLG5_HUMAN	discs large homolog 5	1622	SH3.				cell-cell adhesion|intracellular signal transduction|negative regulation of cell proliferation|regulation of apoptosis	cell junction|cytoplasm	beta-catenin binding|cytoskeletal protein binding|receptor signaling complex scaffold activity			ovary(5)|breast(3)	8	all_cancers(46;0.0316)|all_epithelial(25;0.00147)|Breast(12;0.0015)|Prostate(51;0.0146)		Epithelial(14;0.00105)|OV - Ovarian serous cystadenocarcinoma(4;0.00151)|all cancers(16;0.00446)			AGGTGTCATCCACGTAGAGGA	0.572												
 5 | OR8H2	390151	broad.mit.edu	37	11	55873242	55873242	+	Missense_Mutation	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr11:55873242G>A	uc010riy.1	+	1	724	c.724G>A	c.(724-726)GTC>ATC	p.V242I		NM_001005200	NP_001005200	Q8N162	OR8H2_HUMAN	olfactory receptor, family 8, subfamily H,	242	Helical; Name=6; (Potential).				sensory perception of smell	integral to membrane|plasma membrane	olfactory receptor activity			ovary(1)|skin(1)	2	Esophageal squamous(21;0.00693)					CTCTACTTGCGTCTCTCATCT	0.383										HNSCC(53;0.14)		
 6 | GLYATL2	219970	broad.mit.edu	37	11	58602091	58602091	+	Silent	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr11:58602091G>A	uc001nnd.3	-	6	827	c.696C>T	c.(694-696)TAC>TAT	p.Y232Y	GLYATL2_uc009ymq.2_Silent_p.Y232Y	NM_145016	NP_659453	Q8WU03	GLYL2_HUMAN	glycine-N-acyltransferase-like 2	232						mitochondrion	glycine N-acyltransferase activity			ovary(1)|skin(1)	2		Breast(21;0.0044)|all_epithelial(135;0.0216)			Glycine(DB00145)	CTTGGTGTCTGTATTTGGGGA	0.413												
 7 | CTTN	2017	broad.mit.edu	37	11	70279266	70279266	+	Silent	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr11:70279266G>A	uc001opv.3	+	16	1532	c.1326G>A	c.(1324-1326)CCG>CCA	p.P442P	CTTN_uc001opu.2_Silent_p.P405P|CTTN_uc001opw.3_Silent_p.P405P|CTTN_uc010rqm.1_Silent_p.P126P|CTTN_uc001opx.2_Silent_p.P126P	NM_005231	NP_005222	Q14247	SRC8_HUMAN	cortactin isoform a	442						cell cortex|cytoskeleton|lamellipodium|ruffle|soluble fraction	protein binding			ovary(1)	1			BRCA - Breast invasive adenocarcinoma(2;4.34e-41)|LUSC - Lung squamous cell carcinoma(11;1.51e-13)|STAD - Stomach adenocarcinoma(18;0.0513)	Lung(977;0.0234)|LUSC - Lung squamous cell carcinoma(976;0.133)		GGACGGAGCCGGAGCCCGTGT	0.652												
 8 | DYNC2H1	79659	broad.mit.edu	37	11	103014114	103014114	+	Nonsense_Mutation	SNP	C	T	T			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr11:103014114C>T	uc001pho.2	+	18	2836	c.2692C>T	c.(2692-2694)CGA>TGA	p.R898*	DYNC2H1_uc001phn.1_Nonsense_Mutation_p.R898*|DYNC2H1_uc009yxe.1_Intron	NM_001080463	NP_001073932	Q8NCM8	DYHC2_HUMAN	dynein, cytoplasmic 2, heavy chain 1	898	Stem (By similarity).				cell projection organization|Golgi organization|microtubule-based movement|multicellular organismal development	cilium axoneme|dynein complex|Golgi apparatus|microtubule|plasma membrane	ATP binding|ATPase activity|microtubule motor activity				0		Acute lymphoblastic leukemia(157;0.000966)|all_hematologic(158;0.00348)		BRCA - Breast invasive adenocarcinoma(274;0.000177)|Epithelial(105;0.0785)		AGAAGTAGAACGACTTCCAAG	0.363												
 9 | BCL2L14	79370	broad.mit.edu	37	12	12232401	12232401	+	Silent	SNP	C	T	T			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr12:12232401C>T	uc001rac.2	+	2	363	c.162C>T	c.(160-162)TCC>TCT	p.S54S	ETV6_uc001raa.1_Intron|BCL2L14_uc001raf.1_RNA|BCL2L14_uc001rad.2_Silent_p.S54S|BCL2L14_uc001rae.2_Silent_p.S54S	NM_138723	NP_620049	Q9BZR8	B2L14_HUMAN	BCL2-like 14 isoform 1	54					apoptosis|regulation of apoptosis	cytosol|endomembrane system|intracellular organelle|membrane	protein binding	p.S54S(1)		skin(1)	1		Prostate(47;0.0872)		BRCA - Breast invasive adenocarcinoma(232;0.154)		GAAGTTTGTCCCAGAGGGGCC	0.488												
10 | LIMA1	51474	broad.mit.edu	37	12	50575756	50575756	+	Missense_Mutation	SNP	C	T	T			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr12:50575756C>T	uc001rwj.3	-	10	1379	c.1205G>A	c.(1204-1206)CGT>CAT	p.R402H	LIMA1_uc001rwg.3_Missense_Mutation_p.R100H|LIMA1_uc001rwh.3_Missense_Mutation_p.R241H|LIMA1_uc001rwi.3_Missense_Mutation_p.R243H|LIMA1_uc001rwk.3_Missense_Mutation_p.R403H|LIMA1_uc010smr.1_RNA|LIMA1_uc010sms.1_RNA	NM_016357	NP_057441	Q9UHB6	LIMA1_HUMAN	LIM domain and actin binding 1 isoform b	402	LIM zinc-binding.				actin filament bundle assembly|negative regulation of actin filament depolymerization|ruffle organization	cytoplasm|focal adhesion|stress fiber	actin filament binding|actin monomer binding|zinc ion binding			ovary(1)	1						GGCCAAGAGACGCTCCATTGG	0.473												
11 | DGKA	1606	broad.mit.edu	37	12	56330335	56330335	+	Silent	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr12:56330335G>A	uc001sij.2	+	2	312	c.48G>A	c.(46-48)CTG>CTA	p.L16L	DGKA_uc009zoc.1_Silent_p.L16L|DGKA_uc001sih.1_5'UTR|DGKA_uc001sii.1_5'UTR|DGKA_uc009zod.1_Silent_p.L16L|DGKA_uc009zoe.1_Silent_p.L16L|DGKA_uc001sik.2_Silent_p.L16L|DGKA_uc001sil.2_Silent_p.L16L|DGKA_uc001sim.2_Silent_p.L16L|DGKA_uc001sin.2_Silent_p.L16L|DGKA_uc009zof.2_5'UTR|DGKA_uc001sio.2_5'UTR	NM_001345	NP_001336	P23743	DGKA_HUMAN	diacylglycerol kinase, alpha 80kDa	16					activation of protein kinase C activity by G-protein coupled receptor protein signaling pathway|intracellular signal transduction|platelet activation	plasma membrane	ATP binding|calcium ion binding|diacylglycerol kinase activity			ovary(3)|pancreas(1)	4					Vitamin E(DB00163)	TTGCCCAGCTGCAAAAATACA	0.527												
12 | FREM2	341640	broad.mit.edu	37	13	39266205	39266205	+	Missense_Mutation	SNP	T	G	G			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr13:39266205T>G	uc001uwv.2	+	1	5033	c.4724T>G	c.(4723-4725)GTG>GGG	p.V1575G		NM_207361	NP_997244	Q5SZK8	FREM2_HUMAN	FRAS1-related extracellular matrix protein 2	1575	Extracellular (Potential).|CSPG 11.				cell communication|homophilic cell adhesion|multicellular organismal development	integral to membrane|plasma membrane	calcium ion binding			ovary(7)|pancreas(1)|haematopoietic_and_lymphoid_tissue(1)|central_nervous_system(1)|skin(1)	11		Lung NSC(96;1.04e-07)|Prostate(109;0.00384)|Breast(139;0.00396)|Lung SC(185;0.0565)|Hepatocellular(188;0.114)		all cancers(112;3.32e-07)|Epithelial(112;1.66e-05)|OV - Ovarian serous cystadenocarcinoma(117;0.00154)|BRCA - Breast invasive adenocarcinoma(63;0.00631)|GBM - Glioblastoma multiforme(144;0.0312)		ATCACCCAGGTGCCTATTCAT	0.418												
13 | CHD8	57680	broad.mit.edu	37	14	21871325	21871325	+	Nonsense_Mutation	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr14:21871325G>A	uc001was.1	-	18	2822	c.2728C>T	c.(2728-2730)CAG>TAG	p.Q910*	CHD8_uc001war.1_Nonsense_Mutation_p.Q806*|CHD8_uc001wav.1_Nonsense_Mutation_p.Q352*	NM_020920	NP_065971	Q9HCK8	CHD8_HUMAN	chromodomain helicase DNA binding protein 8	1189	Helicase C-terminal.				ATP-dependent chromatin remodeling|canonical Wnt receptor signaling pathway|negative regulation of transcription, DNA-dependent|negative regulation of Wnt receptor signaling pathway|positive regulation of transcription from RNA polymerase II promoter|positive regulation of transcription from RNA polymerase III promoter|transcription, DNA-dependent	MLL1 complex	ATP binding|beta-catenin binding|DNA binding|DNA helicase activity|DNA-dependent ATPase activity|methylated histone residue binding|p53 binding			ovary(6)|upper_aerodigestive_tract(1)|large_intestine(1)|breast(1)|skin(1)	10	all_cancers(95;0.00121)		Epithelial(56;2.55e-06)|all cancers(55;1.73e-05)	GBM - Glioblastoma multiforme(265;0.00424)		ATGGCAGCCTGTCGAAGGTTG	0.478												
14 | LRFN5	145581	broad.mit.edu	37	14	42360496	42360496	+	Missense_Mutation	SNP	G	C	C			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr14:42360496G>C	uc001wvm.2	+	4	2627	c.1429G>C	c.(1429-1431)GCT>CCT	p.A477P	LRFN5_uc010ana.2_Intron	NM_152447	NP_689660	Q96NI6	LRFN5_HUMAN	leucine rich repeat and fibronectin type III	477	Extracellular (Potential).|Fibronectin type-III.					integral to membrane				ovary(5)|pancreas(2)|central_nervous_system(1)	8			LUAD - Lung adenocarcinoma(50;0.0223)|Lung(238;0.0728)	GBM - Glioblastoma multiforme(112;0.00847)		CAATAATCTGGCTGCTGGAAC	0.403										HNSCC(30;0.082)		
15 | IL32	9235	broad.mit.edu	37	16	3119304	3119305	+	Frame_Shift_Ins	INS	-	G	G	rs2981599		TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr16:3119304_3119305insG	uc002cto.2	+	6	864_865	c.653_654insG	c.(652-654)GACfs	p.D218fs	IL32_uc002ctk.2_Frame_Shift_Ins_p.D115fs|IL32_uc010uwp.1_Frame_Shift_Ins_p.D152fs|IL32_uc010btb.2_Frame_Shift_Ins_p.D162fs|IL32_uc002ctl.2_Frame_Shift_Ins_p.D172fs|IL32_uc002ctm.2_Frame_Shift_Ins_p.D172fs|IL32_uc002ctn.2_Frame_Shift_Ins_p.D172fs|IL32_uc002cts.3_Frame_Shift_Ins_p.D172fs|IL32_uc002ctp.2_Frame_Shift_Ins_p.D152fs|IL32_uc002ctq.2_Frame_Shift_Ins_p.D218fs|IL32_uc002ctr.2_Frame_Shift_Ins_p.D152fs|IL32_uc002ctt.2_Frame_Shift_Ins_p.D172fs|IL32_uc010uwr.1_Frame_Shift_Ins_p.D132fs|IL32_uc002ctu.2_Frame_Shift_Ins_p.D163fs	NM_004221	NP_004212	P24001	IL32_HUMAN	interleukin 32 isoform B	218					cell adhesion|defense response|immune response	extracellular space	cytokine activity			pancreas(1)	1						CCACGGGGGGACAAGGAGGAGC	0.574												
16 | ZNF263	10127	broad.mit.edu	37	16	3339555	3339555	+	Missense_Mutation	SNP	A	G	G			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr16:3339555A>G	uc002cuq.2	+	6	1381	c.1049A>G	c.(1048-1050)GAG>GGG	p.E350G	ZNF263_uc010uww.1_5'UTR|ZNF263_uc002cur.2_5'UTR	NM_005741	NP_005732	O14978	ZN263_HUMAN	zinc finger protein 263	350					viral reproduction	nucleus	DNA binding|sequence-specific DNA binding transcription factor activity|zinc ion binding			skin(3)|ovary(1)	4						CCTCCCCCAGAGGGTGGAATG	0.617												
17 | ADCY9	115	broad.mit.edu	37	16	4016471	4016471	+	Missense_Mutation	SNP	C	T	T			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr16:4016471C>T	uc002cvx.2	-	11	3906	c.3367G>A	c.(3367-3369)GCG>ACG	p.A1123T		NM_001116	NP_001107	O60503	ADCY9_HUMAN	adenylate cyclase 9	1123	Guanylate cyclase 2.|Cytoplasmic (Potential).				activation of adenylate cyclase activity by G-protein signaling pathway|activation of phospholipase C activity|activation of protein kinase A activity|cellular response to glucagon stimulus|energy reserve metabolic process|inhibition of adenylate cyclase activity by G-protein signaling pathway|nerve growth factor receptor signaling pathway|synaptic transmission|transmembrane transport|water transport	integral to plasma membrane	adenylate cyclase activity|ATP binding|metal ion binding			ovary(4)|large_intestine(1)|central_nervous_system(1)	6						TGGGCCTGCGCGGTGTTCAGC	0.602												
18 | RRN3P1	730092	broad.mit.edu	37	16	21817457	21817457	+	Silent	SNP	G	A	A	rs150520281	by1000genomes	TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr16:21817457G>A	uc010vbl.1	-	7	603	c.106C>T	c.(106-108)CTG>TTG	p.L36L	uc002diq.3_Intron	NR_003370				SubName: Full=Putative uncharacterized protein ENSP00000219758;												0						CTTACATCCAGCTTGAGTAGT	0.254												
19 | TERF2IP	54386	broad.mit.edu	37	16	75690204	75690206	+	In_Frame_Del	DEL	GAA	-	-	rs140846731		TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr16:75690204_75690206delGAA	uc002fet.1	+	3	992_994	c.895_897delGAA	c.(895-897)GAAdel	p.E304del		NM_018975	NP_061848	Q9NYB0	TE2IP_HUMAN	telomeric repeat binding factor 2, interacting	304	Asp/Glu-rich (acidic).				negative regulation of DNA recombination at telomere|negative regulation of telomere maintenance|positive regulation of I-kappaB kinase/NF-kappaB cascade|positive regulation of NF-kappaB transcription factor activity|protection from non-homologous end joining at telomere|protein localization to chromosome, telomeric region|regulation of double-strand break repair via homologous recombination|telomere maintenance via telomerase|transcription, DNA-dependent	cytoplasm|nuclear telomere cap complex|nucleoplasm	DNA binding|protein binding			central_nervous_system(1)	1						TGATgaggaggaagaagaagaag	0.369												
20 | NF1	4763	broad.mit.edu	37	17	29533304	29533304	+	Nonsense_Mutation	SNP	C	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr17:29533304C>A	uc002hgg.2	+	12	1640	c.1307C>A	c.(1306-1308)TCG>TAG	p.S436*	NF1_uc002hge.1_Nonsense_Mutation_p.S436*|NF1_uc002hgf.1_Nonsense_Mutation_p.S436*|NF1_uc002hgh.2_Nonsense_Mutation_p.S436*|NF1_uc010csn.1_Nonsense_Mutation_p.S296*	NM_001042492	NP_001035957	P21359	NF1_HUMAN	neurofibromin isoform 1	436					actin cytoskeleton organization|adrenal gland development|artery morphogenesis|camera-type eye morphogenesis|cerebral cortex development|collagen fibril organization|forebrain astrocyte development|forebrain morphogenesis|heart development|liver development|MAPKKK cascade|metanephros development|myelination in peripheral nervous system|negative regulation of cell migration|negative regulation of endothelial cell proliferation|negative regulation of MAP kinase activity|negative regulation of MAPKKK cascade|negative regulation of neuroblast proliferation|negative regulation of oligodendrocyte differentiation|negative regulation of transcription factor import into nucleus|osteoblast differentiation|phosphatidylinositol 3-kinase cascade|pigmentation|positive regulation of adenylate cyclase activity|positive regulation of neuron apoptosis|Ras protein signal transduction|regulation of blood vessel endothelial cell migration|regulation of bone resorption|response to hypoxia|smooth muscle tissue development|spinal cord development|sympathetic nervous system development|visual learning|wound healing	axon|cytoplasm|dendrite|intrinsic to internal side of plasma membrane|nucleus	protein binding|Ras GTPase activator activity	p.?(2)		soft_tissue(159)|central_nervous_system(56)|lung(28)|large_intestine(27)|haematopoietic_and_lymphoid_tissue(18)|ovary(18)|autonomic_ganglia(12)|breast(3)|skin(3)|stomach(2)|thyroid(1)|prostate(1)|kidney(1)|pancreas(1)	330		all_cancers(10;1.29e-12)|all_epithelial(10;0.00347)|all_hematologic(16;0.00556)|Acute lymphoblastic leukemia(14;0.00593)|Breast(31;0.014)|Myeloproliferative disorder(56;0.0255)|all_lung(9;0.0321)|Lung NSC(157;0.0659)		UCEC - Uterine corpus endometrioid carcinoma (4;4.38e-05)|all cancers(4;1.64e-26)|Epithelial(4;9.15e-23)|OV - Ovarian serous cystadenocarcinoma(4;3.58e-21)|GBM - Glioblastoma multiforme(4;0.00146)		TATTGTCACTCGGTTGAACTT	0.413			D|Mis|N|F|S|O		neurofibroma|glioma	neurofibroma|glioma			Neurofibromatosis_type_1	TCGA GBM(6;<1E-08)|TSP Lung(7;0.0071)|TCGA Ovarian(3;0.0088)		
21 | CYP4F11	57834	broad.mit.edu	37	19	16034748	16034748	+	Silent	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr19:16034748G>A	uc002nbu.2	-	7	828	c.792C>T	c.(790-792)CAC>CAT	p.H264H	CYP4F11_uc010eab.1_Silent_p.H264H|CYP4F11_uc002nbt.2_Silent_p.H264H	NM_001128932	NP_001122404	Q9HBI6	CP4FB_HUMAN	cytochrome P450 family 4 subfamily F polypeptide	264					inflammatory response|xenobiotic metabolic process	endoplasmic reticulum membrane|integral to membrane|microsome	aromatase activity|electron carrier activity|heme binding			ovary(1)	1						CTGTGAAGTCGTGCACCAGGT	0.527												
22 | USE1	55850	broad.mit.edu	37	19	17329200	17329200	+	Missense_Mutation	SNP	C	T	T			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr19:17329200C>T	uc002nfo.2	+	6	482	c.422C>T	c.(421-423)ACT>ATT	p.T141I	USE1_uc002nfn.2_3'UTR|USE1_uc010eal.1_Missense_Mutation_p.T141I	NM_018467	NP_060937	Q9NZ43	USE1_HUMAN	unconventional SNARE in the ER 1 homolog	141	Cytoplasmic (Potential).				lysosomal transport|protein catabolic process|protein transport|secretion by cell|vesicle-mediated transport	endoplasmic reticulum membrane|integral to membrane	protein binding				0						AGGAAGAGAACGTGAGTGTCT	0.582												
23 | PSG1	5669	broad.mit.edu	37	19	43382389	43382389	+	Missense_Mutation	SNP	C	T	T			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr19:43382389C>T	uc002ovb.2	-	2	244	c.106G>A	c.(106-108)GTC>ATC	p.V36I	PSG3_uc002ouf.2_Intron|PSG1_uc002oug.1_Missense_Mutation_p.V36I|PSG11_uc002ouw.2_Intron|PSG7_uc002ous.1_Intron|PSG7_uc002out.1_Intron|PSG10_uc002ouv.1_Intron|PSG1_uc002oun.2_RNA|PSG1_uc002our.1_Missense_Mutation_p.V36I|PSG1_uc010eio.1_Missense_Mutation_p.V36I|PSG1_uc002oux.1_5'UTR|PSG1_uc002ouy.1_Missense_Mutation_p.V36I|PSG1_uc002ouz.1_Missense_Mutation_p.V36I|PSG1_uc002ova.1_Missense_Mutation_p.V36I|PSG1_uc002ovc.2_Missense_Mutation_p.V36I|PSG1_uc002ovd.1_Missense_Mutation_p.V36I	NM_006905	NP_008836	P11464	PSG1_HUMAN	pregnancy specific beta-1-glycoprotein 1	36	Ig-like V-type.				female pregnancy	extracellular region				ovary(2)	2		Prostate(69;0.00682)				TCAATCGTGACTTGGGCAGTG	0.463												
24 | CACNG6	59285	broad.mit.edu	37	19	54503003	54503003	+	Silent	SNP	A	G	G			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr19:54503003A>G	uc002qct.2	+	3	1112	c.522A>G	c.(520-522)GGA>GGG	p.G174G	CACNG6_uc002qcu.2_Intron|CACNG6_uc002qcv.2_Intron	NM_145814	NP_665813	Q9BXT2	CCG6_HUMAN	voltage-dependent calcium channel gamma-6	174	Helical; (Potential).					voltage-gated calcium channel complex	voltage-gated calcium channel activity			ovary(2)	2	all_cancers(19;0.0128)|all_epithelial(19;0.00564)|all_lung(19;0.031)|Lung NSC(19;0.0358)|Ovarian(34;0.19)			GBM - Glioblastoma multiforme(134;0.168)		TCCGAGTTGGAGCCGTCTGCT	0.587												
25 | LY75	4065	broad.mit.edu	37	2	160755280	160755280	+	Missense_Mutation	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr2:160755280G>A	uc002ubc.3	-	2	454	c.385C>T	c.(385-387)CAT>TAT	p.H129Y	LY75_uc002ubb.3_Missense_Mutation_p.H129Y|LY75_uc010fos.2_Missense_Mutation_p.H129Y|LY75_uc010fot.1_Missense_Mutation_p.H129Y	NM_002349	NP_002340	O60449	LY75_HUMAN	lymphocyte antigen 75 precursor	129	Extracellular (Potential).|Ricin B-type lectin.				endocytosis|immune response|inflammatory response	integral to plasma membrane	receptor activity|sugar binding				0				COAD - Colon adenocarcinoma(177;0.132)		GCTGTGCCATGTCCATCCTTC	0.522												
26 | SYN3	8224	broad.mit.edu	37	22	32937634	32937634	+	Silent	SNP	G	A	A	rs148217218		TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr22:32937634G>A	uc003amx.2	-	7	999	c.840C>T	c.(838-840)TAC>TAT	p.Y280Y	SYN3_uc003amy.2_Silent_p.Y280Y|SYN3_uc003amz.2_Silent_p.Y279Y	NM_003490	NP_003481	O14994	SYN3_HUMAN	synapsin III isoform IIIa	280	C; actin-binding and synaptic-vesicle binding.				neurotransmitter secretion	cell junction|synaptic vesicle membrane	ATP binding|ligase activity			skin(1)	1						CGGTGGTGGCGTAGGTTTTGG	0.552												
27 | SI	6476	broad.mit.edu	37	3	164786544	164786544	+	Missense_Mutation	SNP	G	T	T			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr3:164786544G>T	uc003fei.2	-	5	511	c.449C>A	c.(448-450)ACT>AAT	p.T150N		NM_001041	NP_001032	P14410	SUIS_HUMAN	sucrase-isomaltase	150	Lumenal.|Isomaltase.				carbohydrate metabolic process|polysaccharide digestion	apical plasma membrane|brush border|Golgi apparatus|integral to membrane	carbohydrate binding|oligo-1,6-glucosidase activity|sucrose alpha-glucosidase activity			ovary(7)|upper_aerodigestive_tract(4)|skin(2)|pancreas(1)	14		Prostate(884;0.00314)|Melanoma(1037;0.0153)|all_neural(597;0.0199)			Acarbose(DB00284)	CTGATTTTGAGTTGTGAAGAG	0.323										HNSCC(35;0.089)		
28 | PYDC2	152138	broad.mit.edu	37	3	191179074	191179074	+	Silent	SNP	C	T	T	rs141891926	by1000genomes	TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr3:191179074C>T	uc011bso.1	+	1	123	c.123C>T	c.(121-123)ACC>ACT	p.T41T		NM_001083308	NP_001076777	Q56P42	PYDC2_HUMAN	pyrin domain containing 2	41	DAPIN.					cytoplasm|nucleus					0						AGCTACAGACCGTCCCCCAGA	0.542												
29 | KLHL5	51088	broad.mit.edu	37	4	39116788	39116788	+	Silent	SNP	C	G	G			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr4:39116788C>G	uc003gts.2	+	10	2124	c.2049C>G	c.(2047-2049)CCC>CCG	p.P683P	KLHL5_uc003gtp.2_Silent_p.P637P|KLHL5_uc003gtq.2_Silent_p.P496P|KLHL5_uc003gtr.1_Silent_p.P683P|KLHL5_uc003gtt.2_Silent_p.P622P	NM_015990	NP_057074	Q96PQ7	KLHL5_HUMAN	kelch-like 5 isoform 1	683	Kelch 5.					cytoplasm|cytoskeleton	actin binding			ovary(1)	1						GATATGATCCCAAAACAGACA	0.383												
30 | GPRIN3	285513	broad.mit.edu	37	4	90170302	90170302	+	Silent	SNP	C	T	T	rs145721148	byFrequency	TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr4:90170302C>T	uc003hsm.1	-	2	1479	c.960G>A	c.(958-960)GCG>GCA	p.A320A		NM_198281	NP_938022	Q6ZVF9	GRIN3_HUMAN	G protein-regulated inducer of neurite outgrowth	320										ovary(3)	3		Hepatocellular(203;0.114)		OV - Ovarian serous cystadenocarcinoma(123;5.67e-05)		CCTGCACCTCCGCATCTTGCC	0.537												
31 | HEATR7B2	133558	broad.mit.edu	37	5	41048449	41048449	+	Missense_Mutation	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr5:41048449G>A	uc003jmj.3	-	16	2151	c.1661C>T	c.(1660-1662)CCT>CTT	p.P554L	HEATR7B2_uc003jmi.3_Missense_Mutation_p.P109L	NM_173489	NP_775760	Q7Z745	HTRB2_HUMAN	HEAT repeat family member 7B2	554	HEAT 6.						binding			ovary(6)|central_nervous_system(2)	8						CAGAAGCTCAGGTAAACGTGT	0.468												
32 | KCTD16	57528	broad.mit.edu	37	5	143853547	143853547	+	Missense_Mutation	SNP	A	C	C			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr5:143853547A>C	uc003lnm.1	+	4	1786	c.1157A>C	c.(1156-1158)AAA>ACA	p.K386T	KCTD16_uc003lnn.1_Missense_Mutation_p.K386T	NM_020768	NP_065819	Q68DU8	KCD16_HUMAN	potassium channel tetramerisation domain	386						cell junction|postsynaptic membrane|presynaptic membrane|voltage-gated potassium channel complex	voltage-gated potassium channel activity			large_intestine(2)|ovary(1)|skin(1)	4		all_hematologic(541;0.118)	KIRC - Kidney renal clear cell carcinoma(527;0.00111)|Kidney(363;0.00176)			AAAGCTGTTAAAGAAAAGCTC	0.443												
33 | UNC5A	90249	broad.mit.edu	37	5	176301527	176301527	+	Silent	SNP	C	T	T			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr5:176301527C>T	uc003mey.2	+	8	1530	c.1338C>T	c.(1336-1338)ACC>ACT	p.T446T	UNC5A_uc010jkg.1_Silent_p.T406T	NM_133369	NP_588610	Q6ZN44	UNC5A_HUMAN	netrin receptor Unc5h1 precursor	446	ZU5.|Cytoplasmic (Potential).				apoptosis|axon guidance|regulation of apoptosis	integral to membrane|plasma membrane				skin(1)	1	all_cancers(89;0.000119)|Renal(175;0.000269)|Lung NSC(126;0.00696)|all_lung(126;0.0115)	Medulloblastoma(196;0.00498)|all_neural(177;0.0138)	Kidney(164;2.23e-05)|KIRC - Kidney renal clear cell carcinoma(164;0.000178)			CCTATGGGACCTTCAACTTCC	0.627												
34 | GRM3	2913	broad.mit.edu	37	7	86469103	86469103	+	Missense_Mutation	SNP	C	T	T	rs141671463		TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr7:86469103C>T	uc003uid.2	+	4	3372	c.2273C>T	c.(2272-2274)ACG>ATG	p.T758M	GRM3_uc010lef.2_Intron|GRM3_uc010leg.2_Missense_Mutation_p.T630M|GRM3_uc010leh.2_Missense_Mutation_p.T350M	NM_000840	NP_000831	Q14832	GRM3_HUMAN	glutamate receptor, metabotropic 3 precursor	758	Cytoplasmic (Potential).				synaptic transmission	integral to plasma membrane		p.T758M(1)		lung(4)|ovary(3)|central_nervous_system(2)|skin(2)|haematopoietic_and_lymphoid_tissue(1)|prostate(1)	13	Esophageal squamous(14;0.0058)|all_lung(186;0.132)|Lung NSC(181;0.142)				Acamprosate(DB00659)|Nicotine(DB00184)	GCCTTCAAAACGCGGAAGTGC	0.428												
35 | CYP3A5	1577	broad.mit.edu	37	7	99262902	99262902	+	Missense_Mutation	SNP	C	G	G			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr7:99262902C>G	uc003urq.2	-	7	644	c.557G>C	c.(556-558)GGC>GCC	p.G186A	ZNF498_uc003urn.2_Intron|CYP3A5_uc003urp.2_Missense_Mutation_p.G6A|CYP3A5_uc003urr.2_Missense_Mutation_p.G73A|CYP3A5_uc011kiy.1_Missense_Mutation_p.G176A|CYP3A5_uc003urs.2_Intron|CYP3A5_uc010lgg.2_Intron	NM_000777	NP_000768	P20815	CP3A5_HUMAN	cytochrome P450, family 3, subfamily A,	186					alkaloid catabolic process|drug catabolic process|oxidative demethylation|steroid metabolic process|xenobiotic metabolic process	endoplasmic reticulum membrane|microsome	aromatase activity|electron carrier activity|heme binding|oxygen binding				0	all_epithelial(64;2.77e-08)|Lung NSC(181;0.00396)|all_lung(186;0.00659)|Esophageal squamous(72;0.0166)				Alfentanil(DB00802)|Clopidogrel(DB00758)|Cyclosporine(DB00091)|Daunorubicin(DB00694)|Indinavir(DB00224)|Irinotecan(DB00762)|Ketoconazole(DB01026)|Lapatinib(DB01259)|Mephenytoin(DB00532)|Midazolam(DB00683)|Mifepristone(DB00834)|Phenytoin(DB00252)|Quinine(DB00468)|Saquinavir(DB01232)|Tacrolimus(DB00864)|Troleandomycin(DB01361)|Verapamil(DB00661)|Vincristine(DB00541)	AAATGATGTGCCAGTAATCAC	0.418												
36 | PIP	5304	broad.mit.edu	37	7	142836647	142836647	+	Missense_Mutation	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr7:142836647G>A	uc003wcf.1	+	4	389	c.353G>A	c.(352-354)CGG>CAG	p.R118Q		NM_002652	NP_002643	P12273	PIP_HUMAN	prolactin-induced protein precursor	118						extracellular region	actin binding			ovary(1)	1	Melanoma(164;0.059)	Ovarian(593;2.82e-05)|Breast(660;0.012)		BRCA - Breast invasive adenocarcinoma(188;0.0026)|LUSC - Lung squamous cell carcinoma(290;0.0733)|Lung(243;0.08)		GATGTTATTCGGGAATTAGGC	0.453												
37 | DMRT3	58524	broad.mit.edu	37	9	990484	990484	+	Missense_Mutation	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chr9:990484G>A	uc003zgw.1	+	2	936	c.898G>A	c.(898-900)GCA>ACA	p.A300T		NM_021240	NP_067063	Q9NQL9	DMRT3_HUMAN	doublesex and mab-3 related transcription factor	300					cell differentiation|multicellular organismal development|sex differentiation	nucleus	DNA binding|metal ion binding|sequence-specific DNA binding transcription factor activity			ovary(2)|central_nervous_system(1)	3		all_lung(10;1.39e-08)|Lung NSC(10;1.42e-08)		Lung(218;0.0196)		GCGAACTTCCGCAGAACCTGA	0.582												
38 | ZBED1	9189	broad.mit.edu	37	X	2407462	2407462	+	Silent	SNP	C	T	T			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chrX:2407462C>T	uc004cqg.2	-	2	1500	c.1299G>A	c.(1297-1299)ACG>ACA	p.T433T	DHRSX_uc004cqf.3_Intron|ZBED1_uc004cqh.1_Silent_p.T433T	NM_004729	NP_004720	O96006	ZBED1_HUMAN	zinc finger, BED-type containing 1	433						nuclear chromosome	DNA binding|metal ion binding|protein dimerization activity|transposase activity				0		all_cancers(21;4.28e-07)|all_epithelial(21;2.07e-08)|all_lung(23;2.81e-05)|Lung NSC(23;0.000693)|Lung SC(21;0.122)				TGATGTTGAGCGTGGTGTTCA	0.597												
39 | RAI2	10742	broad.mit.edu	37	X	17818684	17818684	+	Missense_Mutation	SNP	G	C	C			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chrX:17818684G>C	uc004cyf.2	-	3	2017	c.1447C>G	c.(1447-1449)CAA>GAA	p.Q483E	RAI2_uc004cyg.2_Missense_Mutation_p.Q483E|RAI2_uc010nfa.2_Missense_Mutation_p.Q483E|RAI2_uc004cyh.3_Missense_Mutation_p.Q483E|RAI2_uc011miy.1_Missense_Mutation_p.Q433E	NM_021785	NP_068557	Q9Y5P3	RAI2_HUMAN	retinoic acid induced 2	483					embryo development					ovary(1)|breast(1)	2	Hepatocellular(33;0.183)					TCTTCCCCTTGGCTGTTGATG	0.468												
40 | EIF2S3	1968	broad.mit.edu	37	X	24073154	24073154	+	Silent	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chrX:24073154G>A	uc004dbc.2	+	1	90	c.69G>A	c.(67-69)TTG>TTA	p.L23L		NM_001415	NP_001406	P41091	IF2G_HUMAN	eukaryotic translation initiation factor 2,	23						cytosol	GTP binding|GTPase activity|protein binding|translation initiation factor activity			lung(1)	1						TCACCACCTTGGTGAGGTTTT	0.587											OREG0019714	type=REGULATORY REGION|TFbs=CTCF|Dataset=CTCF ChIP-chip sites (Ren lab)|EvidenceSubtype=ChIP-on-chip (ChIP-chip)
41 | PORCN	64840	broad.mit.edu	37	X	48368320	48368320	+	Missense_Mutation	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chrX:48368320G>A	uc010nie.1	+	2	270	c.112G>A	c.(112-114)GCC>ACC	p.A38T	PORCN_uc004djq.1_Missense_Mutation_p.A151T|PORCN_uc004djr.1_Missense_Mutation_p.A38T|PORCN_uc004djs.1_Missense_Mutation_p.A38T|PORCN_uc004djt.1_5'UTR|PORCN_uc011mlx.1_5'UTR|PORCN_uc004dju.1_5'UTR|PORCN_uc004djv.1_Missense_Mutation_p.A38T|PORCN_uc004djw.1_Missense_Mutation_p.A38T	NM_203475	NP_982301	Q9H237	PORCN_HUMAN	porcupine isoform D	38	Helical; (Potential).|Leu-rich.				Wnt receptor signaling pathway	endoplasmic reticulum membrane|integral to membrane	acyltransferase activity			ovary(2)|central_nervous_system(1)	3						CATCTGCCTCGCCTGCCGCCT	0.413												
42 | WNK3	65267	broad.mit.edu	37	X	54276526	54276526	+	Nonsense_Mutation	SNP	G	A	A			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chrX:54276526G>A	uc004dtd.1	-	16	3053	c.2614C>T	c.(2614-2616)CGA>TGA	p.R872*	WNK3_uc004dtc.1_Nonsense_Mutation_p.R872*	NM_001002838	NP_001002838	Q9BYP7	WNK3_HUMAN	WNK lysine deficient protein kinase 3 isoform 2	872					intracellular protein kinase cascade|positive regulation of establishment of protein localization in plasma membrane|positive regulation of peptidyl-threonine phosphorylation|positive regulation of rubidium ion transmembrane transporter activity|positive regulation of rubidium ion transport|positive regulation of sodium ion transmembrane transporter activity|positive regulation of sodium ion transport|protein autophosphorylation	adherens junction|tight junction	ATP binding|protein binding|protein serine/threonine kinase activity|rubidium ion transmembrane transporter activity|sodium ion transmembrane transporter activity			lung(4)|ovary(3)|kidney(2)|central_nervous_system(2)	11						ATACAGAATCGCCACCGACCA	0.423												
43 | IL1RAPL2	26280	broad.mit.edu	37	X	105011568	105011568	+	Silent	SNP	C	T	T			TCGA-06-5410-01A-01D-1696-08	TCGA-06-5410-10A-01D-1696-08									Somatic	Phase_I	Capture				Illumina GAIIx	67244284-dc40-46cb-a2ac-3f4a38f7bbe4	2df41e20-041f-4e1e-86d9-3c38e36c9b33	g.chrX:105011568C>T	uc004elz.1	+	11	2731	c.1975C>T	c.(1975-1977)CTG>TTG	p.L659L		NM_017416	NP_059112	Q9NP60	IRPL2_HUMAN	interleukin 1 receptor accessory protein-like 2	659	Cytoplasmic (Potential).				central nervous system development|innate immune response	integral to membrane	interleukin-1, Type II, blocking receptor activity			breast(2)|ovary(1)	3						TAATAACACCCTGAAAGATAC	0.448												
44 | 


--------------------------------------------------------------------------------
/HW6/data/TCGA-28-5218.maf.txt:
--------------------------------------------------------------------------------
 1 | Hugo_Symbol	Entrez_Gene_Id	Center	NCBI_Build	Chromosome	Start_position	End_position	Strand	Variant_Classification	Variant_Type	Reference_Allele	Tumor_Seq_Allele1	Tumor_Seq_Allele2	dbSNP_RS	dbSNP_Val_Status	Tumor_Sample_Barcode	Matched_Norm_Sample_Barcode	Match_Norm_Seq_Allele1	Match_Norm_Seq_Allele2	Tumor_Validation_Allele1	Tumor_Validation_Allele2	Match_Norm_Validation_Allele1	Match_Norm_Validation_Allele2	Verification_Status	Validation_Status	Mutation_Status	Sequencing_Phase	Sequence_Source	Validation_Method	Score	BAM_file	Sequencer	Tumor_Sample_UUID	Matched_Norm_Sample_UUID	Genome_Change	Annotation_Transcript	Transcript_Strand	Transcript_Exon	Transcript_Position	cDNA_Change	Codon_Change	Protein_Change	Other_Transcripts	Refseq_mRNA_Id	Refseq_prot_Id	SwissProt_acc_Id	SwissProt_entry_Id	Description	UniProt_AApos	UniProt_Region	UniProt_Site	UniProt_Natural_Variations	UniProt_Experimental_Info	GO_Biological_Process	GO_Cellular_Component	GO_Molecular_Function	COSMIC_overlapping_mutations	COSMIC_fusion_genes	COSMIC_tissue_types_affected	COSMIC_total_alterations_in_gene	Tumorscape_Amplification_Peaks	Tumorscape_Deletion_Peaks	TCGAscape_Amplification_Peaks	TCGAscape_Deletion_Peaks	DrugBank	ref_context	gc_content	CCLE_ONCOMAP_overlapping_mutations	CCLE_ONCOMAP_total_mutations_in_gene	CGC_Mutation_Type	CGC_Translocation_Partner	CGC_Tumor_Types_Somatic	CGC_Tumor_Types_Germline	CGC_Other_Diseases	DNARepairGenes_Role	FamilialCancerDatabase_Syndromes	MUTSIG_Published_Results	OREGANNO_ID	OREGANNO_Values
 2 | ZBTB40	9923	broad.mit.edu	37	1	22835047	22835047	+	Missense_Mutation	SNP	G	T	T			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr1:22835047G>T	uc001bft.2	+	9	2033	c.1522G>T	c.(1522-1524)GAC>TAC	p.D508Y	ZBTB40_uc001bfu.2_Missense_Mutation_p.D508Y|ZBTB40_uc009vqi.1_Missense_Mutation_p.D396Y|ZBTB40_uc001bfv.1_Missense_Mutation_p.D137Y	NM_001083621	NP_001077090	Q9NUA8	ZBT40_HUMAN	zinc finger and BTB domain containing 40	508					bone mineralization|regulation of transcription, DNA-dependent|response to DNA damage stimulus|transcription, DNA-dependent	nucleus	DNA binding|zinc ion binding			ovary(1)	1		Colorectal(325;3.46e-05)|Lung NSC(340;6.55e-05)|all_lung(284;9.87e-05)|Renal(390;0.000219)|Breast(348;0.00222)|Ovarian(437;0.00308)|Myeloproliferative disorder(586;0.0255)		UCEC - Uterine corpus endometrioid carcinoma (279;0.0228)|OV - Ovarian serous cystadenocarcinoma(117;2.86e-26)|Colorectal(126;8.55e-08)|COAD - Colon adenocarcinoma(152;4.1e-06)|GBM - Glioblastoma multiforme(114;1.39e-05)|BRCA - Breast invasive adenocarcinoma(304;0.000712)|KIRC - Kidney renal clear cell carcinoma(1967;0.00374)|STAD - Stomach adenocarcinoma(196;0.00645)|READ - Rectum adenocarcinoma(331;0.0693)|Lung(427;0.216)		TGTGAAACGTGACTCTGGTTC	0.483												
 3 | HMCN1	83872	broad.mit.edu	37	1	186121993	186121993	+	Missense_Mutation	SNP	T	G	G			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr1:186121993T>G	uc001grq.1	+	96	15237	c.15008T>G	c.(15007-15009)GTC>GGC	p.V5003G	HMCN1_uc001grs.1_Missense_Mutation_p.V572G	NM_031935	NP_114141	Q96RW7	HMCN1_HUMAN	hemicentin 1 precursor	5003	Nidogen G2 beta-barrel.				response to stimulus|visual perception	basement membrane	calcium ion binding			ovary(22)|skin(1)	23						CCTGCTGAAGTCACTGTAAAG	0.438												
 4 | OBSCN	84033	broad.mit.edu	37	1	228559651	228559651	+	Missense_Mutation	SNP	C	T	T			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr1:228559651C>T	uc009xez.1	+	94	21216	c.21172C>T	c.(21172-21174)CCT>TCT	p.P7058S	OBSCN_uc001hsr.1_Missense_Mutation_p.P1687S	NM_001098623	NP_001092093	Q5VST9	OBSCN_HUMAN	obscurin, cytoskeletal calmodulin and	7058	Pro-rich.				apoptosis|cell differentiation|induction of apoptosis by extracellular signals|multicellular organismal development|nerve growth factor receptor signaling pathway|regulation of Rho protein signal transduction|small GTPase mediated signal transduction	cytosol|M band|Z disc	ATP binding|metal ion binding|protein binding|protein serine/threonine kinase activity|protein tyrosine kinase activity|Rho guanyl-nucleotide exchange factor activity|structural constituent of muscle|titin binding			stomach(8)|large_intestine(7)|breast(5)|ovary(4)|skin(2)|central_nervous_system(1)|pancreas(1)	28		Prostate(94;0.0405)				CCCATGCCCTCCTGGCTCCTT	0.672												
 5 | KIAA1804	84451	broad.mit.edu	37	1	233518426	233518426	+	Missense_Mutation	SNP	T	C	C			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr1:233518426T>C	uc001hvt.3	+	10	3341	c.3080T>C	c.(3079-3081)ATA>ACA	p.I1027T	KIAA1804_uc001hvu.3_Missense_Mutation_p.I473T	NM_032435	NP_115811	Q5TCX8	M3KL4_HUMAN	mixed lineage kinase 4	1027					activation of JUN kinase activity|protein autophosphorylation		ATP binding|MAP kinase kinase kinase activity|protein homodimerization activity			lung(5)|central_nervous_system(2)|skin(1)	8		all_cancers(173;0.000405)|all_epithelial(177;0.0345)|Prostate(94;0.122)				CGGCCATCTATATATGAACTG	0.428												
 6 | HSD17B7P2	158160	broad.mit.edu	37	10	38654432	38654432	+	Missense_Mutation	SNP	A	G	G	rs2257765		TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr10:38654432A>G	uc010qex.1	+	5	599	c.524A>G	c.(523-525)AAT>AGT	p.N175S	HSD17B7P2_uc001izq.2_RNA|HSD17B7P2_uc001izo.1_RNA|HSD17B7P2_uc001izp.1_Missense_Mutation_p.N173S					SubName: Full=cDNA FLJ60462, highly similar to 3-keto-steroid reductase (EC 1.1.1.270);												0						TCATCTCGCAATGCAAGGAAA	0.453												
 7 | PTPRE	5791	broad.mit.edu	37	10	129861345	129861345	+	Splice_Site	SNP	A	T	T			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr10:129861345A>T	uc001lkb.2	+	10	905	c.626_splice	c.e10-2	p.G209_splice	PTPRE_uc009yat.2_Splice_Site_p.G220_splice|PTPRE_uc010qup.1_Splice_Site|PTPRE_uc009yau.2_Splice_Site_p.G209_splice|PTPRE_uc001lkd.2_Splice_Site_p.G151_splice|PTPRE_uc010quq.1_Splice_Site_p.G110_splice	NM_006504	NP_006495	P23469	PTPRE_HUMAN	protein tyrosine phosphatase, receptor type, E						negative regulation of insulin receptor signaling pathway|protein phosphorylation	cytoplasm|integral to membrane|intermediate filament cytoskeleton|nucleus|plasma membrane	transmembrane receptor protein tyrosine phosphatase activity			ovary(1)	1		all_epithelial(44;1.66e-05)|all_lung(145;0.00456)|Lung NSC(174;0.0066)|all_neural(114;0.0936)|Colorectal(57;0.141)|Breast(234;0.166)|Melanoma(40;0.203)				CTCTACACACAGGTCCCAAAC	0.522												
 8 | MEN1	4221	broad.mit.edu	37	11	64575521	64575521	+	Nonsense_Mutation	SNP	G	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr11:64575521G>A	uc001obj.2	-	3	584	c.511C>T	c.(511-513)CAG>TAG	p.Q171*	MEN1_uc001obk.2_Nonsense_Mutation_p.Q171*|MEN1_uc001obl.2_Nonsense_Mutation_p.Q166*|MEN1_uc001obm.2_Nonsense_Mutation_p.Q166*|MEN1_uc001obn.2_Nonsense_Mutation_p.Q171*|MEN1_uc001obo.2_Nonsense_Mutation_p.Q171*|MEN1_uc001obp.2_Nonsense_Mutation_p.Q166*|MEN1_uc001obq.2_Nonsense_Mutation_p.Q171*|MEN1_uc001obr.2_Nonsense_Mutation_p.Q171*	NM_130800	NP_570712	O00255	MEN1_HUMAN	menin isoform 1	171			Missing (in MEN1).		DNA repair|histone lysine methylation|MAPKKK cascade|negative regulation of cell proliferation|negative regulation of cyclin-dependent protein kinase activity|negative regulation of JNK cascade|negative regulation of osteoblast differentiation|negative regulation of protein phosphorylation|negative regulation of sequence-specific DNA binding transcription factor activity|negative regulation of telomerase activity|negative regulation of transcription from RNA polymerase II promoter|osteoblast development|positive regulation of protein binding|positive regulation of transforming growth factor beta receptor signaling pathway|response to gamma radiation|response to UV|transcription, DNA-dependent	chromatin|cleavage furrow|cytosol|histone methyltransferase complex|nuclear matrix|soluble fraction	double-stranded DNA binding|four-way junction DNA binding|protein binding, bridging|protein N-terminus binding|R-SMAD binding|transcription regulatory region DNA binding|Y-form DNA binding	p.R171Q(1)		parathyroid(105)|pancreas(64)|gastrointestinal_tract_(site_indeterminate)(15)|small_intestine(13)|lung(9)|pituitary(7)|NS(7)|adrenal_gland(5)|soft_tissue(4)|central_nervous_system(4)|thymus(2)|stomach(1)|retroperitoneum(1)|skin(1)	238						CCCAGGGCCTGGCAGGCCCCA	0.602			D|Mis|N|F|S		parathyroid tumors|Pancreatic neuroendocrine tumors	parathyroid adenoma|pituitary adenoma|pancreatic islet cell|carcinoid			Hyperparathyroidism_Familial_Isolated|Multiple_Endocrine_Neoplasia_type_1			
 9 | KRTAP5-11	440051	broad.mit.edu	37	11	71293418	71293418	+	Missense_Mutation	SNP	T	G	G			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr11:71293418T>G	uc001oqu.2	-	1	504	c.466A>C	c.(466-468)ATC>CTC	p.I156L		NM_001005405	NP_001005405	Q6L8G4	KR511_HUMAN	keratin associated protein 5-11	156						keratin filament					0						GAGCCTCAGATCTTACACTGG	0.308												
10 | INPPL1	3636	broad.mit.edu	37	11	71942586	71942586	+	Frame_Shift_Del	DEL	C	-	-			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr11:71942586delC	uc001osf.2	+	13	1689	c.1542delC	c.(1540-1542)GTCfs	p.V514fs	INPPL1_uc001osg.2_Frame_Shift_Del_p.V272fs	NM_001567	NP_001558	O15357	SHIP2_HUMAN	inositol polyphosphate phosphatase-like 1	514					actin filament organization|cell adhesion|endocytosis	actin cortical patch|cytosol	actin binding|SH2 domain binding|SH3 domain binding			skin(2)|ovary(1)|breast(1)	4						CAGTGCTGGTCAAGCCAGAGC	0.567												
11 | RAB30	27314	broad.mit.edu	37	11	82693315	82693315	+	Silent	SNP	G	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr11:82693315G>A	uc001ozu.2	-	6	765	c.504C>T	c.(502-504)TGC>TGT	p.C168C	RAB30_uc009yve.2_Silent_p.C166C|RAB30_uc010rst.1_Silent_p.C166C|RAB30_uc001ozv.2_3'UTR	NM_014488	NP_055303	Q15771	RAB30_HUMAN	RAB30, member RAS oncogene family	168					protein transport|small GTPase mediated signal transduction	Golgi stack|plasma membrane	GTP binding|GTPase activity				0						TGATGAGTCGGCATGCTAAGT	0.438												
12 | SESN3	143686	broad.mit.edu	37	11	94924753	94924756	+	Frame_Shift_Del	DEL	TTGC	-	-			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr11:94924753_94924756delTTGC	uc001pfk.1	-	3	376_379	c.154_157delGCAA	c.(154-159)GCAAACfs	p.A52fs	SESN3_uc010rug.1_5'UTR|SESN3_uc001pfl.2_Frame_Shift_Del_p.A52fs	NM_144665	NP_653266	P58005	SESN3_HUMAN	sestrin 3	52_53					cell cycle arrest	nucleus					0		Acute lymphoblastic leukemia(157;2.26e-05)|all_hematologic(158;0.0123)		BRCA - Breast invasive adenocarcinoma(274;0.234)		TCCACTGTGTTTGCTTGGACAACC	0.368												
13 | HELB	92797	broad.mit.edu	37	12	66698566	66698566	+	Silent	SNP	G	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr12:66698566G>A	uc001sti.2	+	2	271	c.243G>A	c.(241-243)CCG>CCA	p.P81P	HELB_uc010ssz.1_RNA|HELB_uc009zqt.1_RNA	NM_033647	NP_387467	Q8NG08	HELB_HUMAN	helicase (DNA) B	81					DNA replication, synthesis of RNA primer		ATP binding|ATP-dependent 5'-3' DNA helicase activity|single-stranded DNA-dependent ATP-dependent DNA helicase activity			central_nervous_system(1)|pancreas(1)	2			GBM - Glioblastoma multiforme(2;0.000142)	GBM - Glioblastoma multiforme(28;0.0265)		GACGTTTTCCGATAACAGGTG	0.378												
14 | CABP1	9478	broad.mit.edu	37	12	121098105	121098105	+	Missense_Mutation	SNP	G	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr12:121098105G>A	uc001tyu.2	+	3	859	c.792G>A	c.(790-792)ATG>ATA	p.M264I	CABP1_uc001tyv.2_Missense_Mutation_p.M121I|CABP1_uc001tyw.2_Missense_Mutation_p.M61I|CABP1_uc001tyx.2_Missense_Mutation_p.M106I	NM_001033677	NP_001028849	Q9NZU7	CABP1_HUMAN	calcium binding protein 1 isoform 3	264	EF-hand 2.					cell cortex|cell junction|Golgi apparatus|perinuclear region of cytoplasm|postsynaptic density|postsynaptic membrane	calcium ion binding|calcium-dependent protein binding|enzyme inhibitor activity|protein binding			central_nervous_system(1)	1	all_neural(191;0.0684)|Medulloblastoma(191;0.0922)					CCACCGAGATGGAGCTCATCG	0.542												
15 | HERC2	8924	broad.mit.edu	37	15	28389261	28389261	+	Missense_Mutation	SNP	G	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr15:28389261G>A	uc001zbj.2	-	73	11367	c.11261C>T	c.(11260-11262)GCG>GTG	p.A3754V		NM_004667	NP_004658	O95714	HERC2_HUMAN	hect domain and RLD 2	3754					DNA repair|intracellular protein transport|protein ubiquitination involved in ubiquitin-dependent protein catabolic process	nucleus	guanyl-nucleotide exchange factor activity|heme binding|protein binding|ubiquitin-protein ligase activity|zinc ion binding			ovary(4)|lung(4)|skin(3)|upper_aerodigestive_tract(1)|central_nervous_system(1)	13		all_lung(180;1.3e-11)|Breast(32;0.000194)|Colorectal(260;0.227)		all cancers(64;3.93e-09)|Epithelial(43;9.99e-08)|BRCA - Breast invasive adenocarcinoma(123;0.0271)|GBM - Glioblastoma multiforme(186;0.0497)|Lung(196;0.199)		CAGCGAGGCCGCAAGGCGAGG	0.537												
16 | WASH3P	374666	broad.mit.edu	37	15	102515344	102515344	+	Missense_Mutation	SNP	A	C	C	rs141089280	by1000genomes	TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr15:102515344A>C	uc002cdi.2	+	9	1988	c.568A>C	c.(568-570)AAG>CAG	p.K190Q	WASH3P_uc002cdl.2_Missense_Mutation_p.K190Q|WASH3P_uc002cdk.2_RNA|WASH3P_uc002cdp.2_Missense_Mutation_p.K190Q|WASH3P_uc010bpo.2_RNA|WASH3P_uc002cdq.2_RNA|WASH3P_uc002cdr.2_RNA	NR_003659				RecName: Full=WAS protein family homolog 2; AltName: Full=Protein FAM39B; AltName: Full=CXYorf1-like protein on chromosome 2;												0						GCTGGAGAAGAAGCAGCAGAA	0.662												
17 | MRPS34	65993	broad.mit.edu	37	16	1823074	1823075	+	Frame_Shift_Ins	INS	-	G	G			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr16:1823074_1823075insG	uc002cmo.2	-	1	66_67	c.46_47insC	c.(46-48)CGCfs	p.R16fs	NME3_uc002cmm.2_5'Flank|NME3_uc010brv.2_5'Flank|MRPS34_uc002cmn.2_5'Flank|MRPS34_uc002cmp.1_Frame_Shift_Ins_p.R16fs|EME2_uc002cmq.1_5'Flank|EME2_uc010brw.1_5'Flank	NM_023936	NP_076425	P82930	RT34_HUMAN	mitochondrial ribosomal protein S34	16						mitochondrion|ribosome	protein binding			skin(2)	2						GCGCACGCGGCGGGCCAGCTCC	0.723												
18 | RNF40	9810	broad.mit.edu	37	16	30774843	30774843	+	Silent	SNP	G	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr16:30774843G>A	uc002dzq.2	+	4	528	c.405G>A	c.(403-405)GGG>GGA	p.G135G	C16orf93_uc002dzm.2_5'Flank|C16orf93_uc002dzn.2_5'Flank|C16orf93_uc002dzo.2_5'Flank|C16orf93_uc002dzp.2_5'Flank|RNF40_uc010caa.2_Silent_p.G135G|RNF40_uc010cab.2_Silent_p.G135G|RNF40_uc010vfa.1_Intron|RNF40_uc002dzr.2_Silent_p.G135G|RNF40_uc010vfb.1_Intron	NM_014771	NP_055586	O75150	BRE1B_HUMAN	ring finger protein 40	135					histone H2B ubiquitination|histone monoubiquitination|ubiquitin-dependent protein catabolic process	nucleus|synaptosome|ubiquitin ligase complex	protein homodimerization activity|ubiquitin protein ligase binding|zinc ion binding			central_nervous_system(1)	1			Colorectal(24;0.198)			CATGTGATGGGACTCCTCTCC	0.612												
19 | KRTAP1-1	81851	broad.mit.edu	37	17	39197186	39197186	+	Missense_Mutation	SNP	C	T	T			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr17:39197186C>T	uc002hvw.1	-	1	528	c.464G>A	c.(463-465)CGC>CAC	p.R155H		NM_030967	NP_112229	Q07627	KRA11_HUMAN	keratin associated protein 1-1	155						extracellular region|keratin filament					0		Breast(137;0.000496)	STAD - Stomach adenocarcinoma(17;0.000371)			GTAGGATGGGCGGCAGCAGGA	0.637												
20 | C19orf10	56005	broad.mit.edu	37	19	4668644	4668644	+	Missense_Mutation	SNP	C	T	T			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr19:4668644C>T	uc002may.2	-	2	257	c.188G>A	c.(187-189)TGT>TAT	p.C63Y		NM_019107	NP_061980	Q969H8	CS010_HUMAN	hypothetical protein LOC56005 precursor	63						ER-Golgi intermediate compartment|extracellular region					0		Hepatocellular(1079;0.137)		UCEC - Uterine corpus endometrioid carcinoma (162;6.64e-05)|BRCA - Breast invasive adenocarcinoma(158;0.015)		AGTGAACATACACGTATATTT	0.313												
21 | ZNF317	57693	broad.mit.edu	37	19	9267420	9267420	+	Missense_Mutation	SNP	C	T	T			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr19:9267420C>T	uc002mku.2	+	3	433	c.158C>T	c.(157-159)TCC>TTC	p.S53F	ZNF317_uc010xkm.1_Silent_p.F94F|ZNF317_uc002mkv.2_5'UTR|ZNF317_uc002mkw.2_Missense_Mutation_p.S53F|ZNF317_uc002mkx.2_5'UTR|ZNF317_uc002mky.2_5'UTR	NM_020933	NP_065984	Q96PQ6	ZN317_HUMAN	zinc finger protein 317	53					regulation of transcription, DNA-dependent|transcription, DNA-dependent	nucleus	DNA binding|zinc ion binding				0						AGTGTTGGTTCCCAGGTGCAC	0.527												
22 | MAN2B1	4125	broad.mit.edu	37	19	12763065	12763065	+	Missense_Mutation	SNP	C	G	G			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr19:12763065C>G	uc002mub.2	-	16	2024	c.1948G>C	c.(1948-1950)GAC>CAC	p.D650H	MAN2B1_uc010dyv.1_Missense_Mutation_p.D649H	NM_000528	NP_000519	O00754	MA2B1_HUMAN	mannosidase, alpha, class 2B, member 1	650					protein deglycosylation	lysosome	alpha-mannosidase activity|zinc ion binding			ovary(4)|central_nervous_system(2)	6						CTTTCGTTGTCACCTATACTG	0.597												
23 | TMEM147	10430	broad.mit.edu	37	19	36037641	36037641	+	Missense_Mutation	SNP	C	T	T			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr19:36037641C>T	uc002oaj.1	+	4	372	c.275C>T	c.(274-276)GCC>GTC	p.A92V	uc010eec.1_5'Flank|uc002oag.2_5'Flank|TMEM147_uc002oai.1_Missense_Mutation_p.A43V|TMEM147_uc002oak.1_Missense_Mutation_p.P2S	NM_032635	NP_116024	Q9BVK8	TM147_HUMAN	transmembrane protein 147	92						endoplasmic reticulum membrane|integral to membrane	protein binding				0	all_lung(56;1.05e-07)|Lung NSC(56;1.63e-07)|Esophageal squamous(110;0.162)		LUSC - Lung squamous cell carcinoma(66;0.0724)			TCCCGGAATGCCGGCAAGGGA	0.572												
24 | EXOSC5	56915	broad.mit.edu	37	19	41895788	41895788	+	Missense_Mutation	SNP	G	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr19:41895788G>A	uc002oqo.2	-	4	430	c.407C>T	c.(406-408)GCC>GTC	p.A136V	CYP2F1_uc010xvw.1_Intron|BCKDHA_uc002oqm.3_Intron	NM_020158	NP_064543	Q9NQT4	EXOS5_HUMAN	exosome component Rrp46	136					DNA deamination|exonucleolytic nuclear-transcribed mRNA catabolic process involved in deadenylation-dependent decay|rRNA processing	cytosol|exosome (RNase complex)|nucleolus|transcriptionally active chromatin	3'-5'-exoribonuclease activity|protein binding|RNA binding				0						CATGCAGGCGGCATTCAGACA	0.448												
25 | NLRP5	126206	broad.mit.edu	37	19	56539217	56539217	+	Missense_Mutation	SNP	T	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr19:56539217T>A	uc002qmj.2	+	7	1618	c.1618T>A	c.(1618-1620)TGG>AGG	p.W540R	NLRP5_uc002qmi.2_Missense_Mutation_p.W521R	NM_153447	NP_703148	P59047	NALP5_HUMAN	NACHT, LRR and PYD containing protein 5	540	NACHT.					mitochondrion|nucleolus	ATP binding			ovary(3)|skin(2)|kidney(1)|central_nervous_system(1)	7		Colorectal(82;3.46e-05)|Ovarian(87;0.0481)|Renal(1328;0.157)		GBM - Glioblastoma multiforme(193;0.0326)		GGAGGGAGTGTGGAATAGGAA	0.552												
26 | FIGN	55137	broad.mit.edu	37	2	164467616	164467616	+	Silent	SNP	G	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr2:164467616G>A	uc002uck.1	-	3	1037	c.726C>T	c.(724-726)CTC>CTT	p.L242L		NM_018086	NP_060556	Q5HY92	FIGN_HUMAN	fidgetin	242	Pro-rich.					nuclear matrix	ATP binding|nucleoside-triphosphatase activity			large_intestine(2)|ovary(1)|skin(1)	4						TGTAACTGGAGAGGTTAGAAG	0.612												
27 | SNRPB	6628	broad.mit.edu	37	20	2443779	2443779	+	Missense_Mutation	SNP	C	T	T			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr20:2443779C>T	uc002wfz.1	-	5	678	c.515G>A	c.(514-516)CGT>CAT	p.R172H	SNRPB_uc002wga.1_Missense_Mutation_p.R172H|SNRPB_uc010zpv.1_Missense_Mutation_p.R93H|SNRPB_uc002wgb.2_Missense_Mutation_p.R172H|SNORD119_uc010gam.1_5'Flank	NM_198216	NP_937859	P14678	RSMB_HUMAN	small nuclear ribonucleoprotein polypeptide B/B'	172				RG -> L (in Ref. 4).	histone mRNA metabolic process|ncRNA metabolic process|spliceosomal snRNP assembly|termination of RNA polymerase II transcription	catalytic step 2 spliceosome|cytosol|nucleoplasm|U12-type spliceosomal complex|U7 snRNP	protein binding|protein binding|RNA binding			ovary(1)	1						AGGACCCCCACGGCCAGGTGG	0.597												
28 | SENP5	205564	broad.mit.edu	37	3	196613120	196613120	+	Nonsense_Mutation	SNP	G	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr3:196613120G>A	uc003fwz.3	+	2	1317	c.1068G>A	c.(1066-1068)TGG>TGA	p.W356*	SENP5_uc011bty.1_Nonsense_Mutation_p.W356*	NM_152699	NP_689912	Q96HI0	SENP5_HUMAN	SUMO1/sentrin specific peptidase 5	356					cell cycle|cell division|proteolysis	nucleolus	cysteine-type peptidase activity			breast(2)|lung(1)	3	all_cancers(143;1.8e-08)|Ovarian(172;0.0634)|Breast(254;0.135)		Epithelial(36;3.14e-24)|all cancers(36;2.1e-22)|OV - Ovarian serous cystadenocarcinoma(49;1.03e-18)|LUSC - Lung squamous cell carcinoma(58;1.51e-06)|Lung(62;1.95e-06)	GBM - Glioblastoma multiforme(46;0.004)		CAAACGCCTGGGACCAGTCAT	0.468												
29 | OR2J2	26707	broad.mit.edu	37	6	29142195	29142195	+	Silent	SNP	C	G	G			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr6:29142195C>G	uc011dlm.1	+	1	885	c.783C>G	c.(781-783)CTC>CTG	p.L261L		NM_030905	NP_112167	O76002	OR2J2_HUMAN	olfactory receptor, family 2, subfamily J,	261	Extracellular (Potential).				sensory perception of smell	integral to membrane|plasma membrane	olfactory receptor activity				0						GCATGTATCTCCAGCCACCAT	0.433												
30 | MUC17	140453	broad.mit.edu	37	7	100677921	100677921	+	Missense_Mutation	SNP	C	T	T			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr7:100677921C>T	uc003uxp.1	+	3	3277	c.3224C>T	c.(3223-3225)ACT>ATT	p.T1075I	MUC17_uc010lho.1_RNA	NM_001040105	NP_001035194	Q685J3	MUC17_HUMAN	mucin 17 precursor	1075	Extracellular (Potential).|Ser-rich.|59 X approximate tandem repeats.|16.					extracellular region|integral to membrane|plasma membrane	extracellular matrix constituent, lubricant activity			ovary(14)|skin(8)|breast(3)|lung(2)	27	Lung NSC(181;0.136)|all_lung(186;0.182)					CCTGTGACCACTTATTCTCAA	0.488												
31 | EPHA1	2041	broad.mit.edu	37	7	143098437	143098437	+	Nonsense_Mutation	SNP	G	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr7:143098437G>A	uc003wcz.2	-	3	499	c.412C>T	c.(412-414)CGA>TGA	p.R138*		NM_005232	NP_005223	P21709	EPHA1_HUMAN	ephrin receptor EphA1 precursor	138	Extracellular (Potential).					integral to plasma membrane	ATP binding|ephrin receptor activity			ovary(3)|lung(1)|breast(1)	5	Melanoma(164;0.205)	Myeloproliferative disorder(862;0.0255)				AAGGGCCGTCGGAGCTGAATG	0.592												
32 | ATP6V1C1	528	broad.mit.edu	37	8	104075258	104075258	+	Missense_Mutation	SNP	C	G	G			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr8:104075258C>G	uc003ykz.3	+	9	962	c.717C>G	c.(715-717)CAC>CAG	p.H239Q	ATP6V1C1_uc010mbz.2_Missense_Mutation_p.H164Q|ATP6V1C1_uc003yla.2_Missense_Mutation_p.H239Q|ATP6V1C1_uc011lhl.1_Missense_Mutation_p.H164Q	NM_001695	NP_001686	P21283	VATC1_HUMAN	ATPase, H+ transporting, lysosomal V1 subunit	239					ATP hydrolysis coupled proton transport|cellular iron ion homeostasis|insulin receptor signaling pathway|transferrin transport	cytosol|plasma membrane|proton-transporting V-type ATPase, V1 domain	protein binding|proton-transporting ATPase activity, rotational mechanism				0	Lung NSC(17;0.000427)|all_lung(17;0.000533)		OV - Ovarian serous cystadenocarcinoma(57;3.57e-05)|STAD - Stomach adenocarcinoma(118;0.133)			ACTTCAGACACAAAGCCAGAG	0.328												
33 | LRRC6	23639	broad.mit.edu	37	8	133645122	133645122	+	Missense_Mutation	SNP	C	T	T			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr8:133645122C>T	uc003ytk.2	-	5	591	c.517G>A	c.(517-519)GAA>AAA	p.E173K	LRRC6_uc003ytl.2_RNA	NM_012472	NP_036604	Q86X45	LRRC6_HUMAN	leucine rich repeat containing 6	173						cytoplasm				ovary(1)|kidney(1)	2	Ovarian(258;0.00352)|Esophageal squamous(12;0.00507)|all_neural(3;0.0052)|Medulloblastoma(3;0.0922)|Acute lymphoblastic leukemia(118;0.155)		BRCA - Breast invasive adenocarcinoma(115;0.000311)			TGATCTTTTTCCTGCTCTCTG	0.398												
34 | CDKN2B	1030	broad.mit.edu	37	9	22006044	22006044	+	Missense_Mutation	SNP	G	T	T			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr9:22006044G>T	uc003zpo.2	-	2	719	c.359C>A	c.(358-360)GCC>GAC	p.A120D	MTAP_uc003zpi.1_Intron|CDKN2BAS_uc010miw.1_Intron|CDKN2BAS_uc010mix.1_Intron|CDKN2BAS_uc003zpm.2_Intron|CDKN2B_uc003zpn.2_3'UTR	NM_004936	NP_004927	P42772	CDN2B_HUMAN	cyclin-dependent kinase inhibitor 2B isoform 1	120	ANK 4.				cell cycle arrest|cellular response to nutrient|G1 phase of mitotic cell cycle|G2/M transition of mitotic cell cycle|megakaryocyte differentiation|mitotic cell cycle G1/S transition checkpoint|negative regulation of epithelial cell proliferation|positive regulation of transforming growth factor beta receptor signaling pathway|regulation of cyclin-dependent protein kinase activity	cytosol|nucleus	cyclin-dependent protein kinase inhibitor activity|protein kinase binding			lung(1)	1		all_cancers(5;0)|Acute lymphoblastic leukemia(3;0)|all_hematologic(3;0)|all_epithelial(2;1.31e-280)|Lung NSC(2;2.28e-131)|all_lung(2;2.11e-123)|Glioma(2;5.66e-57)|all_neural(2;3.05e-50)|Renal(3;1.07e-46)|Esophageal squamous(3;3.83e-46)|Melanoma(2;8.01e-33)|Breast(3;1.14e-11)|Ovarian(3;0.000128)|Hepatocellular(5;0.00369)|Colorectal(97;0.172)		all cancers(2;0)|GBM - Glioblastoma multiforme(3;0)|Lung(2;3.29e-71)|Epithelial(2;9.08e-60)|LUSC - Lung squamous cell carcinoma(2;5.8e-46)|LUAD - Lung adenocarcinoma(2;1.43e-25)|BRCA - Breast invasive adenocarcinoma(2;5.37e-09)|STAD - Stomach adenocarcinoma(4;4.63e-07)|Kidney(2;6.92e-07)|KIRC - Kidney renal clear cell carcinoma(2;8.63e-07)|OV - Ovarian serous cystadenocarcinoma(39;0.014)|COAD - Colon adenocarcinoma(8;0.143)		CCGCTCCTCGGCCAAGTCCAC	0.701									Familial_Malignant_Melanoma_and_Tumors_of_the_Nervous_System			
35 | CDKN2B	1030	broad.mit.edu	37	9	22006068	22006068	+	Missense_Mutation	SNP	C	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chr9:22006068C>A	uc003zpo.2	-	2	695	c.335G>T	c.(334-336)TGG>TTG	p.W112L	MTAP_uc003zpi.1_Intron|CDKN2BAS_uc010miw.1_Intron|CDKN2BAS_uc010mix.1_Intron|CDKN2BAS_uc003zpm.2_Intron|CDKN2B_uc003zpn.2_3'UTR	NM_004936	NP_004927	P42772	CDN2B_HUMAN	cyclin-dependent kinase inhibitor 2B isoform 1	112	ANK 4.				cell cycle arrest|cellular response to nutrient|G1 phase of mitotic cell cycle|G2/M transition of mitotic cell cycle|megakaryocyte differentiation|mitotic cell cycle G1/S transition checkpoint|negative regulation of epithelial cell proliferation|positive regulation of transforming growth factor beta receptor signaling pathway|regulation of cyclin-dependent protein kinase activity	cytosol|nucleus	cyclin-dependent protein kinase inhibitor activity|protein kinase binding			lung(1)	1		all_cancers(5;0)|Acute lymphoblastic leukemia(3;0)|all_hematologic(3;0)|all_epithelial(2;1.31e-280)|Lung NSC(2;2.28e-131)|all_lung(2;2.11e-123)|Glioma(2;5.66e-57)|all_neural(2;3.05e-50)|Renal(3;1.07e-46)|Esophageal squamous(3;3.83e-46)|Melanoma(2;8.01e-33)|Breast(3;1.14e-11)|Ovarian(3;0.000128)|Hepatocellular(5;0.00369)|Colorectal(97;0.172)		all cancers(2;0)|GBM - Glioblastoma multiforme(3;0)|Lung(2;3.29e-71)|Epithelial(2;9.08e-60)|LUSC - Lung squamous cell carcinoma(2;5.8e-46)|LUAD - Lung adenocarcinoma(2;1.43e-25)|BRCA - Breast invasive adenocarcinoma(2;5.37e-09)|STAD - Stomach adenocarcinoma(4;4.63e-07)|Kidney(2;6.92e-07)|KIRC - Kidney renal clear cell carcinoma(2;8.63e-07)|OV - Ovarian serous cystadenocarcinoma(39;0.014)|COAD - Colon adenocarcinoma(8;0.143)		CAGACGACCCCAGGCATCGCG	0.726									Familial_Malignant_Melanoma_and_Tumors_of_the_Nervous_System			
36 | OTC	5009	broad.mit.edu	37	X	38260629	38260629	+	Missense_Mutation	SNP	T	C	C			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chrX:38260629T>C	uc004def.3	+	5	702	c.488T>C	c.(487-489)CTG>CCG	p.L163P		NM_000531	NP_000522	P00480	OTC_HUMAN	ornithine carbamoyltransferase precursor	163					arginine biosynthetic process|urea cycle	mitochondrial matrix|ornithine carbamoyltransferase complex	ornithine carbamoyltransferase activity			ovary(1)|breast(1)	2					L-Citrulline(DB00155)|L-Ornithine(DB00129)	ATCAATGGGCTGTCAGATTTG	0.408												
37 | HUWE1	10075	broad.mit.edu	37	X	53569470	53569470	+	Missense_Mutation	SNP	G	A	A			TCGA-28-5218-01A-01D-1486-08	TCGA-28-5218-10A-01D-1486-08									Somatic	Phase_I	Capture				Illumina GAIIx	68008a98-3889-4dd2-bcf9-f1f6cbca6355	727e8e46-718d-4e44-96a1-ed3544500a07	g.chrX:53569470G>A	uc004dsp.2	-	74	11812	c.11410C>T	c.(11410-11412)CGG>TGG	p.R3804W	HUWE1_uc004dsn.2_Missense_Mutation_p.R2612W|HUWE1_uc004dsq.1_Missense_Mutation_p.R104W	NM_031407	NP_113584	Q7Z6Z7	HUWE1_HUMAN	HECT, UBA and WWE domain containing 1	3804					base-excision repair|cell differentiation|histone ubiquitination|protein monoubiquitination|protein polyubiquitination|protein ubiquitination involved in ubiquitin-dependent protein catabolic process	cytoplasm|nucleus	DNA binding|protein binding|ubiquitin-protein ligase activity			ovary(8)|large_intestine(4)|breast(4)|kidney(1)	17						TCCTCCCTCCGGACAGACGCC	0.502												
38 | 


--------------------------------------------------------------------------------
/HW6/data/TCGA-32-4209.maf.txt:
--------------------------------------------------------------------------------
 1 | Hugo_Symbol	Entrez_Gene_Id	Center	NCBI_Build	Chromosome	Start_position	End_position	Strand	Variant_Classification	Variant_Type	Reference_Allele	Tumor_Seq_Allele1	Tumor_Seq_Allele2	dbSNP_RS	dbSNP_Val_Status	Tumor_Sample_Barcode	Matched_Norm_Sample_Barcode	Match_Norm_Seq_Allele1	Match_Norm_Seq_Allele2	Tumor_Validation_Allele1	Tumor_Validation_Allele2	Match_Norm_Validation_Allele1	Match_Norm_Validation_Allele2	Verification_Status	Validation_Status	Mutation_Status	Sequencing_Phase	Sequence_Source	Validation_Method	Score	BAM_file	Sequencer	Tumor_Sample_UUID	Matched_Norm_Sample_UUID	Genome_Change	Annotation_Transcript	Transcript_Strand	Transcript_Exon	Transcript_Position	cDNA_Change	Codon_Change	Protein_Change	Other_Transcripts	Refseq_mRNA_Id	Refseq_prot_Id	SwissProt_acc_Id	SwissProt_entry_Id	Description	UniProt_AApos	UniProt_Region	UniProt_Site	UniProt_Natural_Variations	UniProt_Experimental_Info	GO_Biological_Process	GO_Cellular_Component	GO_Molecular_Function	COSMIC_overlapping_mutations	COSMIC_fusion_genes	COSMIC_tissue_types_affected	COSMIC_total_alterations_in_gene	Tumorscape_Amplification_Peaks	Tumorscape_Deletion_Peaks	TCGAscape_Amplification_Peaks	TCGAscape_Deletion_Peaks	DrugBank	ref_context	gc_content	CCLE_ONCOMAP_overlapping_mutations	CCLE_ONCOMAP_total_mutations_in_gene	CGC_Mutation_Type	CGC_Translocation_Partner	CGC_Tumor_Types_Somatic	CGC_Tumor_Types_Germline	CGC_Other_Diseases	DNARepairGenes_Role	FamilialCancerDatabase_Syndromes	MUTSIG_Published_Results	OREGANNO_ID	OREGANNO_Values
 2 | DNAJC11	55735	broad.mit.edu	37	1	6727822	6727822	+	Missense_Mutation	SNP	G	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr1:6727822G>A	uc001aof.2	-	4	431	c.325C>T	c.(325-327)CGG>TGG	p.R109W	DNAJC11_uc010nzt.1_Missense_Mutation_p.R71W|DNAJC11_uc001aog.2_Missense_Mutation_p.R109W|DNAJC11_uc010nzu.1_Missense_Mutation_p.R19W	NM_018198	NP_060668	Q9NVH1	DJC11_HUMAN	DnaJ (Hsp40) homolog, subfamily C, member 11	109					protein folding		heat shock protein binding|unfolded protein binding			ovary(1)|skin(1)	2	Ovarian(185;0.0265)|all_lung(157;0.154)	all_cancers(23;1.97e-27)|all_epithelial(116;1.76e-17)|all_lung(118;2.27e-05)|Lung NSC(185;9.97e-05)|Renal(390;0.00188)|Breast(487;0.00289)|Colorectal(325;0.00342)|Hepatocellular(190;0.0218)|Myeloproliferative disorder(586;0.0393)|Ovarian(437;0.156)		Colorectal(212;2.34e-07)|COAD - Colon adenocarcinoma(227;2.05e-05)|Kidney(185;7.67e-05)|BRCA - Breast invasive adenocarcinoma(304;0.000639)|KIRC - Kidney renal clear cell carcinoma(229;0.00128)|STAD - Stomach adenocarcinoma(132;0.00179)|READ - Rectum adenocarcinoma(331;0.0649)		CTCTGCAGCCGCTCAAACTCC	0.522												
 3 | MST1P9	11223	broad.mit.edu	37	1	17085479	17085479	+	Silent	SNP	T	C	C			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr1:17085479T>C	uc010ock.1	-	10	1212	c.1212A>G	c.(1210-1212)AAA>AAG	p.K404K	CROCC_uc009voy.1_Intron|MST1P9_uc001azp.3_5'UTR	NR_002729				SubName: Full=Hepatocyte growth factor-like protein homolog;												0						GTCTCAACCATTTCCAGGCTC	0.617												
 4 | LPAR3	23566	broad.mit.edu	37	1	85331664	85331665	+	Frame_Shift_Ins	INS	-	A	A	rs76299065		TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr1:85331664_85331665insA	uc001dkl.2	-	1	178_179	c.139_140insT	c.(139-141)TCTfs	p.S47fs	LPAR3_uc009wcj.1_Frame_Shift_Ins_p.S47fs	NM_012152	NP_036284	Q9UBY5	LPAR3_HUMAN	lysophosphatidic acid receptor 3	47	Helical; Name=1; (Potential).				G-protein signaling, coupled to cyclic nucleotide second messenger|synaptic transmission	integral to plasma membrane|intracellular membrane-bounded organelle				lung(3)|ovary(2)	5						CAGAGAATTAGAAAAAAAAATA	0.401												
 5 | NBPF10	100132406	broad.mit.edu	37	1	145324371	145324371	+	Missense_Mutation	SNP	T	C	C			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr1:145324371T>C	uc001end.3	+	30	3826	c.3791T>C	c.(3790-3792)GTA>GCA	p.V1264A	NBPF10_uc009wir.2_Intron|NBPF9_uc010oye.1_Intron|NBPF10_uc001emp.3_Intron|NBPF10_uc010oyi.1_Intron|NBPF10_uc010oyk.1_Intron|NBPF10_uc010oyl.1_Intron|NBPF10_uc001enc.2_Intron|NBPF10_uc010oym.1_Intron|NBPF10_uc010oyn.1_Intron|NBPF10_uc010oyo.1_Intron|NBPF10_uc010oyp.1_RNA	NM_001039703	NP_001034792	A6NDV3	A6NDV3_HUMAN	hypothetical protein LOC100132406	1189											0	all_hematologic(923;0.032)			Colorectal(1306;1.36e-07)|KIRC - Kidney renal clear cell carcinoma(1967;0.00258)		CTGCTGGAGGTAGTAGCGCCT	0.498												
 6 | LOC645166	645166	broad.mit.edu	37	1	148933289	148933289	+	Splice_Site	SNP	A	G	G	rs9729175	by1000genomes	TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr1:148933289A>G	uc010pbc.1	+	3		c.236_splice	c.e3-2		LOC645166_uc010pbd.1_Intron|LOC645166_uc009wkw.1_Splice_Site	NR_027355				Homo sapiens cDNA, FLJ18771.												0						TGCTGCCCGCAGGATATTGTG	0.562												
 7 | TDRKH	11022	broad.mit.edu	37	1	151755433	151755433	+	Silent	SNP	C	T	T			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr1:151755433C>T	uc009wnb.1	-	2	248	c.66G>A	c.(64-66)GGG>GGA	p.G22G	TDRKH_uc001eyy.2_5'UTR|TDRKH_uc001ezb.3_Silent_p.G22G|TDRKH_uc001ezc.3_Silent_p.G22G|TDRKH_uc001eza.3_Silent_p.G22G|TDRKH_uc001ezd.3_Silent_p.G22G|TDRKH_uc010pdn.1_5'UTR	NM_006862	NP_006853	Q9Y2W6	TDRKH_HUMAN	tudor and KH domain containing isoform a	22							RNA binding	p.G22V(1)		ovary(1)|pancreas(1)	2	Hepatocellular(266;0.0877)|all_hematologic(923;0.127)|Melanoma(130;0.14)		LUSC - Lung squamous cell carcinoma(543;0.181)			TGGCTGGGATCCCAAGGCCCA	0.463												
 8 | CRTC2	200186	broad.mit.edu	37	1	153921628	153921628	+	Missense_Mutation	SNP	G	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr1:153921628G>A	uc010ped.1	-	12	1707	c.1637C>T	c.(1636-1638)TCT>TTT	p.S546F	DENND4B_uc001fdd.1_5'Flank|CRTC2_uc001fde.3_RNA|CRTC2_uc001fdf.3_Missense_Mutation_p.S82F	NM_181715	NP_859066	Q53ET0	CRTC2_HUMAN	CREB regulated transcription coactivator 2	546					interspecies interaction between organisms|regulation of transcription, DNA-dependent|transcription, DNA-dependent	cytoplasm|nucleus	protein binding			ovary(2)	2	all_lung(78;3.05e-32)|Lung NSC(65;3.74e-30)|Hepatocellular(266;0.0877)|Melanoma(130;0.199)		LUSC - Lung squamous cell carcinoma(543;0.151)			CCGGTGGTAAGACTGTTGCCC	0.597												
 9 | OR10J3	441911	broad.mit.edu	37	1	159283999	159283999	+	Missense_Mutation	SNP	C	T	T			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr1:159283999C>T	uc010piu.1	-	1	451	c.451G>A	c.(451-453)GGG>AGG	p.G151R		NM_001004467	NP_001004467	Q5JRS4	O10J3_HUMAN	olfactory receptor, family 10, subfamily J,	151	Helical; Name=4; (Potential).				sensory perception of smell	integral to membrane|plasma membrane	olfactory receptor activity			ovary(2)	2	all_hematologic(112;0.0429)					AGGCCAATCCCCAGTGATCCA	0.507												
10 | POU2F1	5451	broad.mit.edu	37	1	167358969	167358969	+	Missense_Mutation	SNP	C	G	G			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr1:167358969C>G	uc001gec.2	+	10	1051	c.889C>G	c.(889-891)CAA>GAA	p.Q297E	POU2F1_uc010plg.1_RNA|POU2F1_uc001ged.2_Missense_Mutation_p.Q295E|POU2F1_uc001gee.2_Missense_Mutation_p.Q297E|POU2F1_uc010plh.1_Missense_Mutation_p.Q234E|POU2F1_uc001gef.2_Missense_Mutation_p.Q309E|POU2F1_uc001geg.2_Missense_Mutation_p.Q195E	NM_002697	NP_002688	P14859	PO2F1_HUMAN	POU class 2 homeobox 1	297	POU-specific.				negative regulation of transcription, DNA-dependent|transcription from RNA polymerase III promoter	nucleoplasm	protein binding|sequence-specific DNA binding|sequence-specific DNA binding transcription factor activity			central_nervous_system(2)|skin(2)|breast(1)	5						GACCTTCAAACAAAGACGAAT	0.438												
11 | C1orf26	54823	broad.mit.edu	37	1	185143825	185143825	+	Missense_Mutation	SNP	G	C	C	rs146489629	byFrequency	TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr1:185143825G>C	uc001grg.3	+	5	660	c.546G>C	c.(544-546)AAG>AAC	p.K182N	C1orf26_uc001grh.3_Missense_Mutation_p.K182N	NM_001105518	NP_001098988	Q5T5J6	SWT1_HUMAN	hypothetical protein LOC54823	182											0						AGAGAGAGAAGATGAAAGAAC	0.353												
12 | CFH	3075	broad.mit.edu	37	1	196694295	196694295	+	Missense_Mutation	SNP	G	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr1:196694295G>A	uc001gtj.3	+	12	1981	c.1741G>A	c.(1741-1743)GAT>AAT	p.D581N		NM_000186	NP_000177	P08603	CFAH_HUMAN	complement factor H isoform a precursor	581	Sushi 10.				complement activation, alternative pathway	extracellular space				skin(4)|ovary(1)|breast(1)	6						CTTAGTTCCTGATCGCAAGAA	0.343												
13 | TLL2	7093	broad.mit.edu	37	10	98155658	98155658	+	Missense_Mutation	SNP	C	T	T			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr10:98155658C>T	uc001kml.1	-	12	1730	c.1504G>A	c.(1504-1506)GTG>ATG	p.V502M	TLL2_uc009xvf.1_Missense_Mutation_p.V480M	NM_012465	NP_036597	Q9Y6L7	TLL2_HUMAN	tolloid-like 2 precursor	502	CUB 2.				cell differentiation|multicellular organismal development|proteolysis	extracellular region	calcium ion binding|metalloendopeptidase activity|zinc ion binding			ovary(1)|pancreas(1)|skin(1)	3		Colorectal(252;0.0846)		Epithelial(162;1.51e-07)|all cancers(201;7.59e-06)		GTAAGTCCCACGTGAAACCCC	0.498											OREG0020398	type=REGULATORY REGION|TFbs=CTCF|Dataset=CTCF ChIP-chip sites (Ren lab)|EvidenceSubtype=ChIP-on-chip (ChIP-chip)
14 | CHUK	1147	broad.mit.edu	37	10	101960490	101960490	+	Silent	SNP	A	G	G			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr10:101960490A>G	uc001kqp.2	-	15	1672	c.1617T>C	c.(1615-1617)GCT>GCC	p.A539A		NM_001278	NP_001269	O15111	IKKA_HUMAN	conserved helix-loop-helix ubiquitous kinase	539					I-kappaB phosphorylation|innate immune response|MyD88-dependent toll-like receptor signaling pathway|MyD88-independent toll-like receptor signaling pathway|nerve growth factor receptor signaling pathway|phosphatidylinositol-mediated signaling|positive regulation of I-kappaB kinase/NF-kappaB cascade|positive regulation of NF-kappaB transcription factor activity|T cell receptor signaling pathway|Toll signaling pathway|toll-like receptor 1 signaling pathway|toll-like receptor 2 signaling pathway|toll-like receptor 3 signaling pathway|toll-like receptor 4 signaling pathway	CD40 receptor complex|cytosol|internal side of plasma membrane|nucleus	ATP binding|identical protein binding|IkappaB kinase activity			ovary(2)|central_nervous_system(2)|large_intestine(1)|lung(1)|breast(1)	7		Colorectal(252;0.117)		Epithelial(162;2.05e-10)|all cancers(201;1.91e-08)		CCATGATTTCAGCATGCAAAG	0.413												
15 | MYO7A	4647	broad.mit.edu	37	11	76901767	76901767	+	Missense_Mutation	SNP	T	C	C			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr11:76901767T>C	uc001oyb.2	+	30	4048	c.3776T>C	c.(3775-3777)ATG>ACG	p.M1259T	MYO7A_uc010rsm.1_Missense_Mutation_p.M1248T|MYO7A_uc001oyc.2_Missense_Mutation_p.M1259T|MYO7A_uc009yus.1_RNA|MYO7A_uc009yut.1_Missense_Mutation_p.M470T	NM_000260	NP_000251	Q13402	MYO7A_HUMAN	myosin VIIA isoform 1	1259	FERM 1.				actin filament-based movement|equilibrioception|lysosome organization|sensory perception of sound|visual perception	cytosol|lysosomal membrane|myosin complex|photoreceptor inner segment|photoreceptor outer segment|synapse	actin binding|ATP binding|calmodulin binding|microfilament motor activity			ovary(3)|breast(1)	4						AAGCCAATCATGTTGCCCGTG	0.597												
16 | C12orf35	55196	broad.mit.edu	37	12	32135884	32135884	+	Missense_Mutation	SNP	C	G	G			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr12:32135884C>G	uc001rks.2	+	4	2409	c.1995C>G	c.(1993-1995)GAC>GAG	p.D665E		NM_018169	NP_060639	Q9HCM1	CL035_HUMAN	hypothetical protein LOC55196	665										ovary(1)|skin(1)	2	all_cancers(9;3.36e-11)|all_epithelial(9;2.56e-11)|all_lung(12;5.67e-10)|Acute lymphoblastic leukemia(23;0.0122)|Lung SC(12;0.0336)|all_hematologic(23;0.0429)|Esophageal squamous(101;0.204)		OV - Ovarian serous cystadenocarcinoma(6;0.0114)			CTAAAAGTGACAGTAGCTGTT	0.423												
17 | ABCD2	225	broad.mit.edu	37	12	40013182	40013182	+	Missense_Mutation	SNP	C	G	G			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr12:40013182C>G	uc001rmb.2	-	1	662	c.236G>C	c.(235-237)GGA>GCA	p.G79A		NM_005164	NP_005155	Q9UBJ2	ABCD2_HUMAN	ATP-binding cassette, sub-family D, member 2	79	Interaction with PEX19.				fatty acid metabolic process|transport	ATP-binding cassette (ABC) transporter complex|integral to plasma membrane|peroxisomal membrane	ATP binding|ATPase activity|protein binding			ovary(2)|upper_aerodigestive_tract(1)|pancreas(1)|central_nervous_system(1)|skin(1)	6						TGCATTCACTCCAGGCGAAGG	0.463												
18 | OR6C2	341416	broad.mit.edu	37	12	55846834	55846834	+	Silent	SNP	C	T	T			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr12:55846834C>T	uc001sgz.1	+	1	837	c.837C>T	c.(835-837)GTC>GTT	p.V279V		NM_054105	NP_473446	Q9NZP2	OR6C2_HUMAN	olfactory receptor, family 6, subfamily C,	279	Helical; Name=7; (Potential).				sensory perception of smell	integral to membrane|plasma membrane	olfactory receptor activity			skin(2)	2						CTACTTCTGTCGCACCCTTGT	0.408												
19 | LEMD3	23592	broad.mit.edu	37	12	65637180	65637180	+	Missense_Mutation	SNP	A	G	G			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr12:65637180A>G	uc001ssl.1	+	10	2324	c.2318A>G	c.(2317-2319)GAT>GGT	p.D773G	LEMD3_uc009zqo.1_Missense_Mutation_p.D772G	NM_014319	NP_055134	Q9Y2U8	MAN1_HUMAN	LEM domain containing 3	773	Interaction with SMAD1, SMAD2, SMAD3 and SMAD5.				negative regulation of activin receptor signaling pathway|negative regulation of BMP signaling pathway|negative regulation of transforming growth factor beta receptor signaling pathway	integral to nuclear inner membrane|membrane fraction	DNA binding|nucleotide binding|protein binding			central_nervous_system(3)|ovary(1)	4			LUAD - Lung adenocarcinoma(6;0.0234)|LUSC - Lung squamous cell carcinoma(43;0.0975)	GBM - Glioblastoma multiforme(28;0.0104)		TTTCATTTAGATAGAAGAAAT	0.279												
20 | IKBIP	121457	broad.mit.edu	37	12	99007867	99007867	+	Silent	SNP	T	C	C			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr12:99007867T>C	uc001tfv.2	-	3	659	c.549A>G	c.(547-549)TCA>TCG	p.S183S	IKBIP_uc001tfw.2_3'UTR	NM_201612	NP_963906	Q70UQ0	IKIP_HUMAN	IKK interacting protein isoform 2	183					induction of apoptosis|response to X-ray	endoplasmic reticulum membrane|integral to membrane	protein binding				0						TTACTAAACCTGAAATCCGTC	0.308												
21 | ACADS	35	broad.mit.edu	37	12	121176677	121176677	+	Missense_Mutation	SNP	C	T	T	rs140853839	byFrequency	TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr12:121176677C>T	uc001tza.3	+	8	1106	c.988C>T	c.(988-990)CGC>TGC	p.R330C	ACADS_uc010szl.1_Missense_Mutation_p.R326C|ACADS_uc001tzb.3_Missense_Mutation_p.R211C	NM_000017	NP_000008	P16219	ACADS_HUMAN	short-chain acyl-CoA dehydrogenase precursor	330						mitochondrial matrix	butyryl-CoA dehydrogenase activity			central_nervous_system(2)	2	all_neural(191;0.0684)|Medulloblastoma(191;0.0922)	Lung NSC(355;0.163)			NADH(DB00157)	GCTGACCTGGCGCGCTGCCAT	0.637												
22 | MMP14	4323	broad.mit.edu	37	14	23312494	23312494	+	Silent	SNP	C	T	T			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr14:23312494C>T	uc001whc.2	+	5	951	c.717C>T	c.(715-717)CAC>CAT	p.H239H		NM_004995	NP_004986	P50281	MMP14_HUMAN	matrix metalloproteinase 14 preproprotein	239	Extracellular (Potential).	Zinc; catalytic.				extracellular matrix|integral to plasma membrane|melanosome	calcium ion binding|metalloendopeptidase activity|zinc ion binding				0	all_cancers(95;9.47e-05)			GBM - Glioblastoma multiforme(265;0.00551)		TGGCTGTGCACGAGCTGGGCC	0.602												
23 | TMC7	79905	broad.mit.edu	37	16	19073157	19073157	+	Missense_Mutation	SNP	A	T	T			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr16:19073157A>T	uc002dfq.2	+	16	2294	c.2164A>T	c.(2164-2166)AGG>TGG	p.R722W	TMC7_uc010vap.1_Missense_Mutation_p.R612W	NM_024847	NP_079123	Q7Z402	TMC7_HUMAN	transmembrane channel-like 7 isoform a	722	Cytoplasmic (Potential).					integral to membrane				skin(2)|ovary(1)	3						AAGGGACATGAGGAACTAACT	0.418												
24 | ULK2	9706	broad.mit.edu	37	17	19699577	19699577	+	Missense_Mutation	SNP	T	G	G			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr17:19699577T>G	uc002gwm.3	-	19	2337	c.1828A>C	c.(1828-1830)ATC>CTC	p.I610L	ULK2_uc002gwn.2_Missense_Mutation_p.I610L	NM_001142610	NP_001136082	Q8IYT8	ULK2_HUMAN	unc-51-like kinase 2	610					signal transduction		ATP binding|protein binding|protein serine/threonine kinase activity			skin(2)|large_intestine(1)|stomach(1)	4	all_cancers(12;4.97e-05)|all_epithelial(12;0.00362)|Breast(13;0.186)					GTTTTAGGGATTTTGAAAGGA	0.413												
25 | CNTNAP1	8506	broad.mit.edu	37	17	40847561	40847561	+	Silent	SNP	G	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr17:40847561G>A	uc002iay.2	+	19	3231	c.3015G>A	c.(3013-3015)CCG>CCA	p.P1005P	CNTNAP1_uc010wgs.1_RNA	NM_003632	NP_003623	P78357	CNTP1_HUMAN	contactin associated protein 1 precursor	1005	Extracellular (Potential).				axon guidance|cell adhesion	paranode region of axon	receptor activity|receptor binding|SH3 domain binding|SH3/SH2 adaptor activity			ovary(3)|breast(3)|upper_aerodigestive_tract(1)|lung(1)	8		Breast(137;0.000143)		BRCA - Breast invasive adenocarcinoma(366;0.143)		TCTTTGAGCCGGGCACCTGGA	0.567												
26 | TBCD	6904	broad.mit.edu	37	17	80842049	80842049	+	Nonsense_Mutation	SNP	C	T	T			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr17:80842049C>T	uc002kfz.2	+	15	1634	c.1504C>T	c.(1504-1506)CGA>TGA	p.R502*	TBCD_uc002kfx.1_Nonsense_Mutation_p.R485*|TBCD_uc002kfy.1_Nonsense_Mutation_p.R502*	NM_005993	NP_005984	Q9BTW9	TBCD_HUMAN	beta-tubulin cofactor D	502					'de novo' posttranslational protein folding|adherens junction assembly|negative regulation of cell-substrate adhesion|negative regulation of microtubule polymerization|post-chaperonin tubulin folding pathway|tight junction assembly	adherens junction|cytoplasm|lateral plasma membrane|microtubule|tight junction	beta-tubulin binding|chaperone binding|GTPase activator activity				0	Breast(20;0.000523)|all_neural(118;0.0779)	all_cancers(8;0.0266)|all_epithelial(8;0.0696)	OV - Ovarian serous cystadenocarcinoma(97;0.0868)|BRCA - Breast invasive adenocarcinoma(99;0.18)			GGTGTTTGACCGAGACATAAA	0.443												
27 | ZNF492	57615	broad.mit.edu	37	19	22846757	22846757	+	Nonsense_Mutation	SNP	G	T	T	rs112130958		TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr19:22846757G>T	uc002nqw.3	+	4	530	c.286G>T	c.(286-288)GAA>TAA	p.E96*		NM_020855	NP_065906	Q9P255	ZN492_HUMAN	zinc finger protein 492	96					regulation of transcription, DNA-dependent|transcription, DNA-dependent	nucleus	DNA binding|zinc ion binding				0		all_cancers(12;0.0266)|all_lung(12;0.00187)|Lung NSC(12;0.0019)|all_epithelial(12;0.00203)|Hepatocellular(1079;0.244)				GGTGCACAAAGAATGTTACAA	0.299												
28 | CEACAM5	1048	broad.mit.edu	37	19	42224052	42224052	+	Missense_Mutation	SNP	G	A	A	rs138799075	byFrequency	TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr19:42224052G>A	uc002ork.2	+	7	1817	c.1696G>A	c.(1696-1698)GCA>ACA	p.A566T	CEACAM5_uc002orj.1_Missense_Mutation_p.A565T|CEACAM5_uc002orl.2_Missense_Mutation_p.A566T	NM_004363	NP_004354	P06731	CEAM5_HUMAN	carcinoembryonic antigen-related cell adhesion	566	Ig-like 6.					anchored to membrane|basolateral plasma membrane|integral to plasma membrane				skin(2)	2				OV - Ovarian serous cystadenocarcinoma(3;0.00278)|all cancers(3;0.00625)|Epithelial(262;0.0379)|GBM - Glioblastoma multiforme(1328;0.142)		AAGAAATGACGCAAGAGCCTA	0.522												
29 | KLK11	11012	broad.mit.edu	37	19	51528895	51528895	+	Missense_Mutation	SNP	A	G	G			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr19:51528895A>G	uc002pvd.1	-	2	201	c.89T>C	c.(88-90)CTC>CCC	p.L30P	KLK11_uc002pvb.1_5'UTR|KLK11_uc002pve.1_5'UTR|KLK11_uc002pvf.1_5'UTR|KLK11_uc002pvc.3_5'UTR|KLK11_uc010eom.2_5'UTR	NM_144947	NP_659196	Q9UBX7	KLK11_HUMAN	kallikrein 11 isoform 2 precursor	30					proteolysis	extracellular region	serine-type endopeptidase activity				0		all_neural(266;0.026)		OV - Ovarian serous cystadenocarcinoma(262;0.00327)|GBM - Glioblastoma multiforme(134;0.00878)		CATGGCCTGGAGGGGGGAGGA	0.627												
30 | LILRB2	10288	broad.mit.edu	37	19	54783717	54783717	+	Missense_Mutation	SNP	C	T	T	rs145209585	byFrequency	TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr19:54783717C>T	uc002qfb.2	-	4	550	c.284G>A	c.(283-285)CGA>CAA	p.R95Q	LILRA6_uc002qew.1_Intron|LILRB2_uc010eri.2_Missense_Mutation_p.R95Q|LILRB2_uc010erj.2_RNA|LILRB2_uc002qfc.2_Missense_Mutation_p.R95Q|LILRB2_uc010yet.1_5'UTR|LILRB2_uc010yeu.1_RNA	NM_005874	NP_005865	Q8N423	LIRB2_HUMAN	leukocyte immunoglobulin-like receptor,	95	Extracellular (Potential).|Ig-like C2-type 1.				cell surface receptor linked signaling pathway|cell-cell signaling|cellular defense response|immune response|regulation of immune response	integral to plasma membrane|membrane fraction	receptor activity			skin(1)	1	Ovarian(34;0.19)			GBM - Glioblastoma multiforme(193;0.105)		ACAGCCATATCGCCCTGTGTG	0.557												
31 | HEATR5B	54497	broad.mit.edu	37	2	37295836	37295836	+	Missense_Mutation	SNP	T	C	C			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr2:37295836T>C	uc002rpp.1	-	8	1261	c.1165A>G	c.(1165-1167)ATG>GTG	p.M389V		NM_019024	NP_061897	Q9P2D3	HTR5B_HUMAN	HEAT repeat containing 5B	389							binding			ovary(5)|skin(2)|breast(1)	8		all_hematologic(82;0.21)				ACGGCTTTCATTTGTTTTCCA	0.353												
32 | EIF5B	9669	broad.mit.edu	37	2	99977775	99977777	+	In_Frame_Del	DEL	TGA	-	-			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr2:99977775_99977777delTGA	uc002tab.2	+	4	595_597	c.411_413delTGA	c.(409-414)AGTGAT>AGT	p.D142del		NM_015904	NP_056988	O60841	IF2P_HUMAN	eukaryotic translation initiation factor 5B	142	Poly-Asp.				regulation of translational initiation	cytosol	GTP binding|GTPase activity|protein binding|translation initiation factor activity			ovary(2)|pancreas(1)	3						ACTCTGGGAGTGATGATGATGAT	0.345												
33 | KIF5C	3800	broad.mit.edu	37	2	149793797	149793797	+	Splice_Site	SNP	G	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr2:149793797G>A	uc010zbu.1	+	4	660	c.292_splice	c.e4-1	p.G98_splice		NM_004522	NP_004513	O60282	KIF5C_HUMAN	kinesin family member 5C						microtubule-based movement|organelle organization	cytoplasm|kinesin complex|microtubule	ATP binding|microtubule motor activity			skin(1)	1				BRCA - Breast invasive adenocarcinoma(221;0.108)		TCGCCCACTAGGGGAAGCTGC	0.512												
34 | SIRPG	55423	broad.mit.edu	37	20	1629729	1629729	+	Missense_Mutation	SNP	C	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr20:1629729C>A	uc002wfm.1	-	2	464	c.399G>T	c.(397-399)AAG>AAT	p.K133N	SIRPG_uc002wfn.1_Missense_Mutation_p.K133N|SIRPG_uc002wfo.1_Missense_Mutation_p.K133N	NM_018556	NP_061026	Q9P1W8	SIRPG_HUMAN	signal-regulatory protein gamma isoform 1	133	Extracellular (Potential).|Ig-like V-type.				blood coagulation|cell adhesion|cell junction assembly|cell-cell signaling|intracellular signal transduction|leukocyte migration|negative regulation of cell proliferation|positive regulation of cell proliferation|positive regulation of cell-cell adhesion|positive regulation of T cell activation	integral to membrane|intracellular|plasma membrane	protein binding			ovary(1)	1						CTGGTCCAGACTTAAACTCCA	0.493												
35 | SIGLEC1	6614	broad.mit.edu	37	20	3673751	3673751	+	Missense_Mutation	SNP	T	C	C			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr20:3673751T>C	uc002wja.2	-	14	3536	c.3536A>G	c.(3535-3537)TAC>TGC	p.Y1179C	SIGLEC1_uc002wjb.1_5'UTR|SIGLEC1_uc002wiz.3_Missense_Mutation_p.Y1179C	NM_023068	NP_075556	Q9BZZ2	SN_HUMAN	sialoadhesin precursor	1179	Ig-like C2-type 12.|Extracellular (Potential).				cell-cell adhesion|cell-matrix adhesion|endocytosis|inflammatory response	extracellular region|integral to membrane|plasma membrane	sugar binding			pancreas(4)|ovary(2)|skin(2)|breast(1)|central_nervous_system(1)	10						CTCCAGGAGGTAGGTCAGGCG	0.682												
36 | NTSR1	4923	broad.mit.edu	37	20	61340984	61340984	+	Missense_Mutation	SNP	G	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr20:61340984G>A	uc002ydf.2	+	1	796	c.425G>A	c.(424-426)CGC>CAC	p.R142H		NM_002531	NP_002522	P30989	NTR1_HUMAN	neurotensin receptor 1	142	Extracellular (Potential).					endoplasmic reticulum|Golgi apparatus|integral to plasma membrane	neurotensin receptor activity, G-protein coupled			skin(2)|lung(1)|central_nervous_system(1)	4	Breast(26;3.65e-08)		BRCA - Breast invasive adenocarcinoma(19;3.63e-06)			GCCGGCTGCCGCGGCTACTAC	0.677												
37 | TBX1	6899	broad.mit.edu	37	22	19748718	19748718	+	Missense_Mutation	SNP	G	T	T			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr22:19748718G>T	uc002zqb.2	+	3	454	c.325G>T	c.(325-327)GGT>TGT	p.G109C	TBX1_uc002zqa.1_Missense_Mutation_p.G109C|TBX1_uc002zqc.2_Missense_Mutation_p.G109C	NM_080646	NP_542377	O43435	TBX1_HUMAN	T-box 1 isoform A	109					embryonic viscerocranium morphogenesis|heart development|parathyroid gland development|pharyngeal system development|regulation of transcription from RNA polymerase II promoter|soft palate development|thymus development	nucleus	protein homodimerization activity|sequence-specific DNA binding|sequence-specific DNA binding transcription factor activity			ovary(1)|breast(1)	2	Colorectal(54;0.0993)	all_lung(157;3.05e-06)				GAAGGTGGCCGGTGTGAGCGT	0.592												
38 | LZTR1	8216	broad.mit.edu	37	22	21341825	21341825	+	Missense_Mutation	SNP	G	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr22:21341825G>A	uc002zto.2	+	4	456	c.353G>A	c.(352-354)CGT>CAT	p.R118H	LZTR1_uc002ztn.2_Missense_Mutation_p.R77H|LZTR1_uc011ahy.1_Missense_Mutation_p.R99H|LZTR1_uc010gsr.1_5'UTR	NM_006767	NP_006758	Q8N653	LZTR1_HUMAN	leucine-zipper-like transcription regulator 1	118	Kelch 1.				anatomical structure morphogenesis		sequence-specific DNA binding transcription factor activity			ovary(2)|lung(2)	4	all_cancers(11;1.83e-25)|all_epithelial(7;9.19e-23)|Lung NSC(8;3.06e-15)|all_lung(8;5.05e-14)|Melanoma(16;0.000465)|Ovarian(15;0.0028)|Colorectal(54;0.0332)|all_neural(72;0.142)	Lung SC(17;0.0262)	LUSC - Lung squamous cell carcinoma(15;0.000204)|Lung(15;0.00494)|Epithelial(17;0.195)			CCGGCCCCCCGTTACCACCAC	0.662												
39 | TFIP11	24144	broad.mit.edu	37	22	26890269	26890269	+	Missense_Mutation	SNP	A	C	C			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr22:26890269A>C	uc003acr.2	-	13	2368	c.1994T>G	c.(1993-1995)GTG>GGG	p.V665G	TFIP11_uc003acq.2_Missense_Mutation_p.V24G|TFIP11_uc003acs.2_Missense_Mutation_p.V665G|TFIP11_uc003act.2_Missense_Mutation_p.V665G|uc003acu.1_RNA	NM_012143	NP_036275	Q9UBB9	TFP11_HUMAN	tuftelin interacting protein 11	665					biomineral tissue development	catalytic step 2 spliceosome|cytoplasm|nuclear speck	DNA binding|sequence-specific DNA binding transcription factor activity				0						AGAGCACAGCACCTGCCAAAA	0.463												
40 | NEFH	4744	broad.mit.edu	37	22	29886360	29886360	+	Missense_Mutation	SNP	C	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr22:29886360C>A	uc003afo.2	+	4	2802	c.2731C>A	c.(2731-2733)CCT>ACT	p.P911T	NEFH_uc003afp.2_5'UTR	NM_021076	NP_066554	P12036	NFH_HUMAN	neurofilament, heavy polypeptide 200kDa	917	Tail.				cell death|nervous system development	neurofilament					0						GAAGGAGGCTCCTGCCAAGGT	0.502												
41 | DEPDC5	9681	broad.mit.edu	37	22	32275577	32275577	+	Missense_Mutation	SNP	G	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr22:32275577G>A	uc003als.2	+	37	3921	c.3779G>A	c.(3778-3780)CGC>CAC	p.R1260H	DEPDC5_uc011als.1_Missense_Mutation_p.R1191H|DEPDC5_uc011alu.1_Missense_Mutation_p.R1291H|DEPDC5_uc011alv.1_RNA|DEPDC5_uc003alt.2_Missense_Mutation_p.R1282H|DEPDC5_uc003alu.2_Missense_Mutation_p.R709H|DEPDC5_uc003alv.2_RNA|DEPDC5_uc003alw.2_Missense_Mutation_p.R558H|DEPDC5_uc011alx.1_Missense_Mutation_p.R108H|DEPDC5_uc010gwk.2_Missense_Mutation_p.R286H|DEPDC5_uc011aly.1_Missense_Mutation_p.R108H	NM_014662	NP_055477	O75140	DEPD5_HUMAN	DEP domain containing 5 isoform 1	1260					intracellular signal transduction					ovary(4)|central_nervous_system(3)|pancreas(1)	8						AGCTTCCAGCGCAAGTGGTTT	0.607												
42 | STXBP5L	9515	broad.mit.edu	37	3	120871386	120871386	+	Silent	SNP	A	G	G			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr3:120871386A>G	uc003eec.3	+	8	872	c.732A>G	c.(730-732)GAA>GAG	p.E244E	STXBP5L_uc011bji.1_Silent_p.E244E	NM_014980	NP_055795	Q9Y2K9	STB5L_HUMAN	syntaxin binding protein 5-like	244	WD 4.				exocytosis|protein transport	cytoplasm|integral to membrane|plasma membrane				ovary(7)|skin(2)	9				GBM - Glioblastoma multiforme(114;0.0694)		AAAGAGCAGAACTGAGAGTTT	0.333												
43 | PEX5L	51555	broad.mit.edu	37	3	179616029	179616029	+	Frame_Shift_Del	DEL	T	-	-			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr3:179616029delT	uc003fki.1	-	3	229	c.99delA	c.(97-99)AAAfs	p.K33fs	PEX5L_uc011bqd.1_5'UTR|PEX5L_uc011bqe.1_Intron|PEX5L_uc011bqf.1_5'UTR|PEX5L_uc003fkj.1_Intron|PEX5L_uc010hxd.1_Frame_Shift_Del_p.K31fs|PEX5L_uc011bqg.1_Frame_Shift_Del_p.K9fs|PEX5L_uc011bqh.1_Intron	NM_016559	NP_057643	Q8IYB4	PEX5R_HUMAN	peroxisomal biogenesis factor 5-like	33					protein import into peroxisome matrix|regulation of cAMP-mediated signaling	cytosol|peroxisomal membrane	peroxisome matrix targeting signal-1 binding			ovary(3)|large_intestine(1)	4	all_cancers(143;3.94e-14)|Ovarian(172;0.0338)|Breast(254;0.183)		OV - Ovarian serous cystadenocarcinoma(80;1.75e-26)|GBM - Glioblastoma multiforme(14;0.000518)			CCCTAGAGCCTTTTCCCTATA	0.413												
44 | C3orf59	151963	broad.mit.edu	37	3	192517421	192517421	+	Missense_Mutation	SNP	T	C	C	rs117555490	by1000genomes	TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr3:192517421T>C	uc011bsp.1	-	2	551	c.230A>G	c.(229-231)GAC>GGC	p.D77G		NM_178496	NP_848591	Q8IYB1	M21D2_HUMAN	hypothetical protein LOC151963	77											0	all_cancers(143;1.56e-08)|Ovarian(172;0.0634)		OV - Ovarian serous cystadenocarcinoma(49;2.8e-18)|LUSC - Lung squamous cell carcinoma(58;8.04e-06)|Lung(62;8.62e-06)	GBM - Glioblastoma multiforme(46;3.86e-05)		AAGCTTTTGGTCCAGCTTTTG	0.443												
45 | NKX3-2	579	broad.mit.edu	37	4	13546023	13546023	+	Missense_Mutation	SNP	C	T	T			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr4:13546023C>T	uc003gmx.2	-	1	92	c.16G>A	c.(16-18)GCC>ACC	p.A6T		NM_001189	NP_001180	P78367	NKX32_HUMAN	NK3 homeobox 2	6					negative regulation of chondrocyte differentiation|transcription from RNA polymerase II promoter	nucleus	sequence-specific DNA binding|sequence-specific DNA binding transcription factor activity				0						AAGGTGTTGGCGCCGCGCACA	0.557												
46 | GEMIN5	25929	broad.mit.edu	37	5	154275813	154275813	+	Missense_Mutation	SNP	G	C	C			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr5:154275813G>C	uc003lvx.3	-	24	3519	c.3436C>G	c.(3436-3438)CAC>GAC	p.H1146D	GEMIN5_uc011ddk.1_Missense_Mutation_p.H1145D	NM_015465	NP_056280	Q8TEQ6	GEMI5_HUMAN	gemin 5	1146					ncRNA metabolic process|protein complex assembly|spliceosomal snRNP assembly	Cajal body|cytosol|spliceosomal complex	protein binding|snRNA binding			skin(2)|ovary(1)	3	Renal(175;0.00488)	Medulloblastoma(196;0.0354)|all_neural(177;0.147)	KIRC - Kidney renal clear cell carcinoma(527;0.00112)			TTCCAAGTGTGGTAAGAGGAG	0.547												
47 | AGXT2L2	85007	broad.mit.edu	37	5	177649920	177649920	+	Missense_Mutation	SNP	C	T	T	rs142142484	byFrequency	TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr5:177649920C>T	uc003miz.2	-	7	886	c.634G>A	c.(634-636)GCT>ACT	p.A212T	AGXT2L2_uc003miy.2_5'UTR|AGXT2L2_uc003mjc.2_Missense_Mutation_p.A171T|AGXT2L2_uc003mja.2_RNA|AGXT2L2_uc003mjb.2_5'UTR|AGXT2L2_uc003mjd.1_Missense_Mutation_p.A70T	NM_153373	NP_699204	Q8IUZ5	AT2L2_HUMAN	alanine-glyoxylate aminotransferase 2-like 2	212						mitochondrion	pyridoxal phosphate binding|transaminase activity			pancreas(1)	1	all_cancers(89;0.00185)|Renal(175;0.000269)|Lung NSC(126;0.00858)|all_lung(126;0.0139)	all_neural(177;0.00802)|Medulloblastoma(196;0.0145)|all_hematologic(541;0.248)	Kidney(164;2.23e-05)|KIRC - Kidney renal clear cell carcinoma(164;0.000178)	GBM - Glioblastoma multiforme(465;0.181)|all cancers(165;0.235)	L-Alanine(DB00160)|Pyridoxal Phosphate(DB00114)	AGAGACTCAGCGAAGAAGGCT	0.587												
48 | BMP6	654	broad.mit.edu	37	6	7727630	7727630	+	Missense_Mutation	SNP	G	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr6:7727630G>A	uc003mxu.3	+	1	620	c.442G>A	c.(442-444)GAC>AAC	p.D148N		NM_001718	NP_001709	P22004	BMP6_HUMAN	bone morphogenetic protein 6 preproprotein	148					BMP signaling pathway|cartilage development|growth|immune response|positive regulation of aldosterone biosynthetic process|positive regulation of bone mineralization|positive regulation of osteoblast differentiation|positive regulation of pathway-restricted SMAD protein phosphorylation|positive regulation of transcription from RNA polymerase II promoter|SMAD protein signal transduction	extracellular space	BMP receptor binding|cytokine activity|growth factor activity|protein heterodimerization activity			large_intestine(2)|ovary(1)	3	Ovarian(93;0.0721)					CGCCGACAACGACGAGGACGG	0.682												
49 | NFKBIE	4794	broad.mit.edu	37	6	44229437	44229437	+	Missense_Mutation	SNP	C	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr6:44229437C>A	uc003oxe.1	-	3	1059	c.1034G>T	c.(1033-1035)TGC>TTC	p.C345F		NM_004556	NP_004547	O00221	IKBE_HUMAN	nuclear factor of kappa light polypeptide gene	345	ANK 3.				cytoplasmic sequestering of transcription factor		protein binding			breast(2)	2	all_cancers(18;2e-05)|all_lung(25;0.00747)|Hepatocellular(11;0.00908)|Ovarian(13;0.0273)		Colorectal(64;0.00337)|COAD - Colon adenocarcinoma(64;0.00536)			TTCCAGCAGGCAGCGGGCACA	0.632												
50 | COL19A1	1310	broad.mit.edu	37	6	70589454	70589454	+	Translation_Start_Site	SNP	G	T	T			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr6:70589454G>T	uc003pfc.1	+	2	112	c.-5G>T	c.(-7--3)AAGGC>AATGC			NM_001858	NP_001849	Q14993	COJA1_HUMAN	alpha 1 type XIX collagen precursor						cell differentiation|cell-cell adhesion|extracellular matrix organization|skeletal system development	collagen	extracellular matrix structural constituent|protein binding, bridging			ovary(2)|breast(2)	4						ATGGTTTCAAGGCACAATGAG	0.418												
51 | RFPL4B	442247	broad.mit.edu	37	6	112671523	112671523	+	Missense_Mutation	SNP	C	T	T	rs143103700	byFrequency	TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr6:112671523C>T	uc003pvx.1	+	3	925	c.613C>T	c.(613-615)CGC>TGC	p.R205C		NM_001013734	NP_001013756	Q6ZWI9	RFPLB_HUMAN	ret finger protein-like 4B	205	B30.2/SPRY.						zinc ion binding				0		all_cancers(87;9.44e-05)|all_hematologic(75;0.000114)|all_epithelial(87;0.00265)|Colorectal(196;0.0209)		all cancers(137;0.0202)|OV - Ovarian serous cystadenocarcinoma(136;0.0477)|Epithelial(106;0.0646)|GBM - Glioblastoma multiforme(226;0.0866)|BRCA - Breast invasive adenocarcinoma(108;0.244)		CCCTCGCCTTCGCCGTGTGGG	0.448												
52 | DSE	29940	broad.mit.edu	37	6	116757341	116757341	+	Missense_Mutation	SNP	C	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr6:116757341C>A	uc003pws.2	+	6	1904	c.1710C>A	c.(1708-1710)GAC>GAA	p.D570E	DSE_uc011ebg.1_Missense_Mutation_p.D589E|DSE_uc003pwt.2_Missense_Mutation_p.D570E|DSE_uc003pwu.2_Missense_Mutation_p.D237E	NM_001080976	NP_001074445	Q9UL01	DSE_HUMAN	dermatan sulfate epimerase precursor	570					dermatan sulfate biosynthetic process	endoplasmic reticulum|Golgi apparatus|integral to membrane	chondroitin-glucuronate 5-epimerase activity			ovary(1)	1		all_cancers(87;0.00019)|all_epithelial(87;0.000416)|Ovarian(999;0.133)|Colorectal(196;0.234)		Epithelial(106;0.00915)|OV - Ovarian serous cystadenocarcinoma(136;0.0149)|GBM - Glioblastoma multiforme(226;0.0189)|all cancers(137;0.0262)		TCCTTGTAGACCAAATACACC	0.502												
53 | CLIP2	7461	broad.mit.edu	37	7	73771699	73771699	+	Silent	SNP	G	A	A			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr7:73771699G>A	uc003uam.2	+	6	1434	c.1107G>A	c.(1105-1107)GAG>GAA	p.E369E	CLIP2_uc003uan.2_Silent_p.E369E	NM_003388	NP_003379	Q9UDT6	CLIP2_HUMAN	CAP-GLY domain containing linker protein 2	369	Potential.					microtubule associated complex				skin(3)	3						AGCACATTGAGCAGCTGCTGG	0.617												
54 | PRUNE2	158471	broad.mit.edu	37	9	79321219	79321219	+	Missense_Mutation	SNP	C	G	G			TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr9:79321219C>G	uc010mpk.2	-	8	6095	c.5971G>C	c.(5971-5973)GAA>CAA	p.E1991Q	PRUNE2_uc004akj.3_5'Flank|PRUNE2_uc010mpl.1_5'Flank	NM_015225	NP_056040	Q8WUY3	PRUN2_HUMAN	prune homolog 2	1991					apoptosis|G1 phase|induction of apoptosis	cytoplasm	metal ion binding|pyrophosphatase activity				0						TCTTGACCTTCATTAGTTGAA	0.423												
55 | DAPK1	1612	broad.mit.edu	37	9	90266587	90266587	+	Missense_Mutation	SNP	C	T	T	rs36214022		TCGA-32-4209-01A-01D-1353-08	TCGA-32-4209-10A-01D-1353-08									Somatic	Phase_I	Capture				Illumina GAIIx	0c30ef40-b943-4281-84d7-8d574882abd4	77169aad-6bd8-4b1b-bb48-c02960d41ea0	g.chr9:90266587C>T	uc004apc.2	+	17	1910	c.1772C>T	c.(1771-1773)CCT>CTT	p.P591L	DAPK1_uc004apd.2_Missense_Mutation_p.P591L|DAPK1_uc011ltg.1_Missense_Mutation_p.P591L|DAPK1_uc011lth.1_Missense_Mutation_p.P328L|DAPK1_uc004apf.1_Missense_Mutation_p.P145L	NM_004938	NP_004929	P53355	DAPK1_HUMAN	death-associated protein kinase 1	591	ANK 7.		P -> L.		apoptosis|induction of apoptosis by extracellular signals|intracellular protein kinase cascade	actin cytoskeleton|cytoplasm	ATP binding|calmodulin binding|protein serine/threonine kinase activity			ovary(1)|breast(1)	2						GGCAACATGCCTATCGTGGTG	0.498									Chronic_Lymphocytic_Leukemia_Familial_Clustering_of			
56 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # STAT115 Homeworks
2 | 
3 | 
4 | 


--------------------------------------------------------------------------------