├── GSE_analysis_microarray.Rmd ├── Info.txt └── README.md /GSE_analysis_microarray.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "GSE analysis for microarray data" 3 | author: "Lian Chee Foong" 4 | date: "2/3/2021" 5 | output: 6 | pdf_document: 7 | df_print: tibble 8 | toc: true 9 | number_sections: true 10 | highlight: tango 11 | fig_width: 3 12 | fig_height: 1.5 13 | --- 14 | 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = TRUE) 18 | ``` 19 | 20 | *** 21 | 22 | # Introduction 23 | 24 | This R script is to demonstrate the steps to download GSE data from NCBI GEO database, and to obtain the differential expressed genes from GSE data. 25 | 26 | ## A little background of GEO 27 | 28 | The Gene Expression Omnibus (GEO) is a data repository hosted by the National Center for Biotechnology Information (NCBI). NCBI contains all publicly available nucleotide and protein sequences. Presently, all records in GenBank NCBI are generated from direct submission to the DNA sequence databases from the original authors, who volunteer their records to make the data publicly available or do so as part of the publication process. The NCBI GEO is intended to house different types of expression data, covering all type of sequencing data in both raw and processed formats. 29 | 30 | ## Example here 31 | 32 | Here, GSE63477, which is a microarray data, will be analysed. It contains an expression profile of prostate cancer cells (LNCaP) after treatment of cabazitaxel or docetaxel for 16 hr. You may read the details in NCBI GEO, under the “overall design” section. Or for better understanding, it’s always good if we read the original paper...link here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi 33 | 34 | *** 35 | # Using GEOquery to obtain microarray data 36 | 37 | First, set the working directory. 38 | 39 | The GEOquery package allows you to access the data from GEO. Depending on your needs, you can download only the processed data and metadata provided by the depositor. In some cases, you may want to download the raw data as well, if it was provided by the depositor. 40 | 41 | The function to download a GEO dataset is ‘getGEO’ from the ‘GEOquery’ package.Check how many platforms used for the GSE data, usually there will only be one platform. We set the first object in the list and gse now is an expressionSet. You can see that it contains assayData, phenoData, feature etc. 42 | 43 | We can have a look at the sample information, gene annotation, and the expression data. This allow us to have a rough idea of the information storing in this expressionSet. 44 | 45 | ```{R} 46 | getwd() 47 | setwd("C:/Users/Lynn/Documents/R_GEOdata") 48 | 49 | ###https://sbc.shef.ac.uk/geo_tutorial/tutorial.nb.html#Introduction 50 | #----import the data------------------------------------ 51 | library(GEOquery) 52 | my_id <- "GSE63477" 53 | gse <- getGEO(my_id) 54 | 55 | ## check how many platforms used 56 | length(gse) 57 | gse <-gse[[1]] 58 | gse 59 | 60 | pData(gse) ## print the sample information 61 | fData(gse) ## print the gene annotation 62 | exprs(gse)[1,] ## print the expression data 63 | ``` 64 | 65 | # Check the normalisation and scales used 66 | 67 | We can use this command to check the normalization method, to see if the data has already processed. So this expression data was RMA normalized and filtered to remove low-expressing genes. RMA means Robust Multiarray Average, it is the most common method to determine probeset expression level for Affymetrix arrays. 68 | 69 | The ‘summary’ function can then be used to print the distributions of expression levels, if the data has been log transformed, typically in the range of 0 to 16. Hmm...the values are quite big and go beyond 16. It’s quite weird because RMA should already log2 transform the data at the last step, but well, let’s do it on our own and move to the next step. For a more careful analysis, we can try to run the raw data of this dataset again, by applying RMA normalization on our own, to see if there is any difference. 70 | 71 | Anyway, here, let’s perform a log2 transformation. We may check the summary of expression level again. And draw a boxplot. We can see that the distributions of each sample are highly similar, which means the data have been normalised. 72 | 73 | ```{R} 74 | pData(gse)$data_processing[1] 75 | # For visualisation and statistical analysis, we will inspect the data to 76 | # discover what scale the data are presented in. The methods we will use assume 77 | # the data are on a log2 scale; typically in the range of 0 to 16. 78 | 79 | ## have a look on the expression value 80 | summary(exprs(gse)) 81 | # From this output we clearly see that the values go beyond 16, 82 | # so we need to perform a log2 transformation. 83 | exprs(gse) <- log2(exprs(gse)) 84 | 85 | # check again the summary 86 | summary(exprs(gse)) 87 | 88 | boxplot(exprs(gse),outline=F) 89 | ``` 90 | 91 | # Inspect the clinical variables 92 | 93 | Now we try to look into the pData for the elements that we need for the analysis. We want to know the sample name, whether it is treatment or control...in this dataset, the info is stored in the column of 'characteristics_ch1.1'. 94 | 95 | We can use the select function to subset the column of interest. It will be useful also to rename the column to something more shorter and easier. 96 | 97 | To make a column of simplified group names for each sample, ‘Stringr’ is helpful. Two new columns are created, named “group” and "serum". The function ‘str_detect’ is to detect the presence of the words, and then fill the row accordingly. It totally depends on your dataset to make these necessary categories in the new columns, just modify these commands for your dataset of interest. 98 | 99 | ```{R} 100 | 101 | library(dplyr) 102 | sampleInfo <- pData(gse) 103 | head(sampleInfo) 104 | 105 | table(sampleInfo$characteristics_ch1.1) 106 | 107 | #Let's pick just those columns that seem to contain factors we might 108 | #need for the analysis. 109 | sampleInfo <- select(sampleInfo, characteristics_ch1.1) 110 | 111 | ## Optionally, rename to more convenient column names 112 | sampleInfo <- rename(sampleInfo, sample = characteristics_ch1.1) 113 | 114 | head(sampleInfo) 115 | dim(sampleInfo) 116 | sampleInfo$sample 117 | 118 | library(stringr) 119 | sampleInfo$group <- "" 120 | for(i in 1:nrow(sampleInfo)){ 121 | if(str_detect(sampleInfo$sample[i], "CTRL") && str_detect(sampleInfo$sample[i], "full")) 122 | {sampleInfo$group[i] <- "Conf"} 123 | 124 | if(str_detect(sampleInfo$sample[i], "CTRL") && str_detect(sampleInfo$sample[i], "dextran")) 125 | {sampleInfo$group[i] <- "Cond"} 126 | 127 | if(str_detect(sampleInfo$sample[i], "cabazitaxel") && str_detect(sampleInfo$sample[i], "full")) 128 | {sampleInfo$group[i] <- "cabazitaxelf"} 129 | 130 | if(str_detect(sampleInfo$sample[i], "cabazitaxel") && str_detect(sampleInfo$sample[i], "dextran")) 131 | {sampleInfo$group[i] <- "cabazitaxeld"} 132 | 133 | if(str_detect(sampleInfo$sample[i], "docetaxel") && str_detect(sampleInfo$sample[i], "full")) 134 | {sampleInfo$group[i] <- "docetaxelf"} 135 | 136 | if(str_detect(sampleInfo$sample[i], "docetaxel") && str_detect(sampleInfo$sample[i], "dextran")) 137 | {sampleInfo$group[i] <- "docetaxeld"} 138 | } 139 | 140 | sampleInfo 141 | 142 | sampleInfo$serum <- "" 143 | for(i in 1:nrow(sampleInfo)){ 144 | if(str_detect(sampleInfo$sample[i], "dextran")) 145 | {sampleInfo$serum[i] <- "dextran"} 146 | 147 | if(str_detect(sampleInfo$sample[i], "full")) 148 | {sampleInfo$serum[i] <- "full_serum"} 149 | 150 | } 151 | 152 | sampleInfo <- sampleInfo[,-1] 153 | sampleInfo 154 | ``` 155 | 156 | # Sample clustering and Principal Components Analaysis 157 | 158 | We can visualize the correlations between the samples by hierarchical clustering. 159 | 160 | The function ‘cor’ can calculate the correlation on the scale of 0 to 1, in a pairwise fashion between all samples, then visualise on a heatmap. There are many ways to create heatmaps in R, here I use ‘pheatmap’, the only argument it requires is a matrix of numeric values. 161 | 162 | We can add more sample info onto the plot to get a better pic of the group and clustering. Here, we make use of the 'sampleInfo' file that was created earlier, to match with the columns of the correlation matrix. 163 | 164 | ```{R} 165 | 166 | library(pheatmap) 167 | ## argument use="c" stops an error if there are any missing data points 168 | 169 | corMatrix <- cor(exprs(gse),use="c") 170 | pheatmap(corMatrix) 171 | 172 | ## Print the rownames of the sample information and check it matches the correlation matrix 173 | 174 | rownames(sampleInfo) 175 | colnames(corMatrix) 176 | 177 | ## If not, force the rownames to match the columns 178 | #rownames(sampleInfo) <- colnames(corMatrix) 179 | 180 | pheatmap(corMatrix, annotation_col= sampleInfo) 181 | ``` 182 | 183 | Another way is to use Principal component analysis (PCA). It has to note that the data has to be transposed, so that the genelist is in the column, while rownames are the samples, so the PCA process will not run out of the memory in the oher way round. 184 | 185 | Let’s add labels to plot the results, here, we use the ‘ggplots2’ package, while the ‘ggrepel’ package is used to position the text labels more cleverly so they can be read. Here we can see that the samples are divided into two groups based on the serum treatment types. 186 | 187 | ```{R} 188 | #make PCA 189 | library(ggplot2) 190 | library(ggrepel) 191 | ## MAKE SURE TO TRANSPOSE THE EXPRESSION MATRIX 192 | 193 | pca <- prcomp(t(exprs(gse))) 194 | 195 | ## Join the PCs to the sample information 196 | cbind(sampleInfo, pca$x) %>% 197 | ggplot(aes(x = PC1, y=PC2, col=group, label=paste("",group))) + geom_point() + geom_text_repel() 198 | ``` 199 | 200 | # Differential expression analysis 201 | 202 | In this section, we use the limma package to perform differential expressions. Limma stands for “Linear models for microarray”. Here, we need to tell limma what sample groups we want to compare. I choose sampleInfo$group. A design matrix will be created, this is a matrix of 0 and 1, one row for each sample and one column for each sample group. 203 | 204 | We can rename the column names so that it is easier to see. 205 | 206 | Now, let’s check if the expression data contain any lowly-expressed genes, this will affect the quality of DE analysis. A big problem in doing statistical analysis like limma is the inference of type 1 statistical errors, also called false positive. One simple way to reduce the possibility for type 1 errors is to do fewer comparisons, by filtering the data. For example, we know that not all genes are expressed in all tissues and many genes will not be expressed in any sample. As a result, in DGE analysis, it makes sense to remove the genes that are likely not expressed at all. 207 | 208 | It is quite subjective how one defines a gene being expressed, here, I follow the tutorial, to make the cut off at the median of the expression values, which means to consider around 50% of the genes will not be expressed. Keep those expressed genes if they are present in more than 2 samples. 209 | 210 | We can see that around half of the genes are not qualified as an “expressed” gene here, which makes sense, bcoz our cut-off is the median value. 211 | 212 | ```{R} 213 | library(limma) 214 | design <- model.matrix(~0 + sampleInfo$group) 215 | design 216 | 217 | ## the column names are a bit ugly, so we will rename 218 | colnames(design) <- c("Cabazitaxeld","Cabazitaxelf","Cond","Conf","Docetaxeld","Docetaxelf") 219 | 220 | design 221 | 222 | ## calculate median expression level 223 | cutoff <- median(exprs(gse)) 224 | 225 | ## TRUE or FALSE for whether each gene is "expressed" in each sample 226 | is_expressed <- exprs(gse) > cutoff 227 | 228 | ## Identify genes expressed in more than 2 samples 229 | 230 | keep <- rowSums(is_expressed) > 3 231 | 232 | ## check how many genes are removed / retained. 233 | table(keep) 234 | 235 | ## subset to just those expressed genes 236 | gse <- gse[keep,] 237 | ``` 238 | 239 | Here there is a little extra step to find out the outliers. This has to be done carefully so the filtered data won't be too biased. We calculate ‘weights’ to define the reliability of each sample. The ‘arrayweights’ function will assign a score to each sample, with a value of 1 implying equal weight. Samples with score less than 1 are down-weighed, or else up-weighed. 240 | 241 | 242 | ```{R} 243 | # coping with outliers 244 | ## calculate relative array weights 245 | aw <- arrayWeights(exprs(gse),design) 246 | aw 247 | ``` 248 | 249 | Now we have a design matrix, we need to estimate the coefficients. For this design, we will essentially average the replicate arrays for each sample level. In addition, we will calculate standard deviations for each gene, and the average intensity for the genes across all microarrays. 250 | 251 | We are ready to tell limma which pairwise contrasts that we want to make. For this experiment, we are going to contrast treatment (there are two types of texane drugs) and control in each serum type. So there are 4 contrasts to specify. 252 | 253 | To do the statistical comparisons, Limma uses Bayesian statistics to minimize type 1 error. The eBayes function performs the tests. To summarize the results of the statistical test, 'topTable' will adjust the p-values and return the top genes that meet the cutoffs that you supply as arguments; while 'decideTests' will make calls for DEGs by adjusting the p-values and applying a logFC cutoff similar to topTable. 254 | 255 | ```{R} 256 | ## Fitting the coefficients 257 | fit <- lmFit(exprs(gse), design, 258 | weights = aw) 259 | 260 | head(fit$coefficients) 261 | 262 | ## Making comparisons between samples, can define multiple contrasts 263 | contrasts <- makeContrasts(Docetaxeld - Cond, Cabazitaxeld - Cond, Docetaxelf - Conf, Cabazitaxelf - Conf, levels = design) 264 | 265 | fit2 <- contrasts.fit(fit, contrasts) 266 | fit2 <- eBayes(fit2) 267 | 268 | 269 | topTable(fit2) 270 | topTable1 <- topTable(fit2, coef=1) 271 | topTable2 <- topTable(fit2, coef=2) 272 | topTable3 <- topTable(fit2, coef=3) 273 | topTable4 <- topTable(fit2, coef=4) 274 | 275 | #if we want to know how many genes are differentially expressed overall, we can use the decideTest function. 276 | summary(decideTests(fit2)) 277 | table(decideTests(fit2)) 278 | ``` 279 | 280 | # Further visualization with gene annotation 281 | 282 | Now we want to know the gene name associated with the gene ID. The annotation data can be retrieved with the ‘fData’ function. Let’s select the ID, GB_ACC, this is genbank accession ID. Add into fit2 table. 283 | 284 | The “Volcano Plot” function is a common way of visualising the results of a DE analysis. The x axis shows the log-fold change and the y axis is some measure of statistical significance, which in this case is the log-odds, or “B” statistic. We can also change the color of those genes with p value cutoff more than 0.05, and fold change cut off more than 1. 285 | 286 | ```{R} 287 | 288 | anno <- fData(gse) 289 | head(anno) 290 | 291 | anno <- select(anno,ID,GB_ACC) 292 | fit2$genes <- anno 293 | 294 | topTable(fit2) 295 | 296 | ## Create volcano plot 297 | full_results1 <- topTable(fit2, coef=1, number=Inf) 298 | library(ggplot2) 299 | ggplot(full_results1,aes(x = logFC, y=B)) + geom_point() 300 | 301 | ## change according to your needs 302 | p_cutoff <- 0.05 303 | fc_cutoff <- 1 304 | 305 | 306 | full_results1 %>% 307 | mutate(Significant = P.Value < p_cutoff, abs(logFC) > fc_cutoff ) %>% 308 | ggplot(aes(x = logFC, y = B, col=Significant)) + geom_point() 309 | ``` 310 | 311 | # Further visualization of selected gene 312 | 313 | I think at this point, we are quite clear about data structure of GSE data. It has an experiment data, pData; the expression data, exprs; and also annotation data, fData. And we have learned how to check the expression data, normalize them, and perform differential expression analysis. 314 | 315 | Now, with the differential expression gene tables, there are some downstream analyses that we can continue, such as to export a full table of DE genes, to generate a heatmap for your selected genes, get the gene list for a particular pathway, or survival analysis (but this is only for those clinical data). 316 | 317 | Here, I just want to look into the fold change data of a selected gene, whether it is significantly differential expressed or not. 318 | 319 | ```{R} 320 | 321 | ## Get the results for particular gene of interest 322 | #GB_ACC for Nkx3-1 is NM_001256339 or NM_006167 323 | ##no NM_001256339 in this data 324 | full_results2 <- topTable(fit2, coef=2, number=Inf) 325 | full_results3 <- topTable(fit2, coef=3, number=Inf) 326 | full_results4 <- topTable(fit2, coef=4, number=Inf) 327 | filter(full_results1, GB_ACC == "NM_006167") 328 | filter(full_results2, GB_ACC == "NM_006167") 329 | filter(full_results3, GB_ACC == "NM_006167") 330 | filter(full_results4, GB_ACC == "NM_006167") 331 | ``` 332 | 333 | That’s all for the walk-through, thanks for reading, I hope you have learned something new here. 334 | 335 | # Acknowlegdement 336 | Many thanks to the following tutorials made publicly available: 337 | 338 | 1. Introduction to microarray analysis GSE15947, by Department of Statistics, Purdue Univrsity https://www.stat.purdue.edu/bigtap/online/docs/Introduction_to_Microarray_Analysis_GSE15947.html 339 | 340 | 2. Mark Dunning, 2020, GEO tutorial, by Sheffield Bioinformatics Core https://sbc.shef.ac.uk/geo_tutorial/tutorial.nb.html#Further_processing_and_visualisation_of_DE_results 341 | 342 | -------------------------------------------------------------------------------- /Info.txt: -------------------------------------------------------------------------------- 1 | # How-to-analyze-GEO-microarray-data 2 | GSE analysis for microarray data, for the tutorial as shown in https://www.youtube.com/watch?v=JQ24T9fpXvg&t=947s 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ----- 2 | 3 | # Introduction 4 | 5 | This R script is to demonstrate the steps to download GSE data from NCBI 6 | GEO database, and to obtain the differential expressed genes from GSE 7 | data. 8 | 9 | ## A little background of GEO 10 | 11 | The Gene Expression Omnibus (GEO) is a data repository hosted by the 12 | National Center for Biotechnology Information (NCBI). NCBI contains all 13 | publicly available nucleotide and protein sequences. Presently, all 14 | records in GenBank NCBI are generated from direct submission to the DNA 15 | sequence databases from the original authors, who volunteer their 16 | records to make the data publicly available or do so as part of the 17 | publication process. The NCBI GEO is intended to house different types 18 | of expression data, covering all type of sequencing data in both raw and 19 | processed formats. 20 | 21 | ## Example here 22 | 23 | Here, GSE63477, which is a microarray data, will be analysed. It 24 | contains an expression profile of prostate cancer cells (LNCaP) after 25 | treatment of cabazitaxel or docetaxel for 16 hr. You may read the 26 | details in NCBI GEO, under the “overall design” section. Or for better 27 | understanding, it’s always good if we read the original paper…link here: 28 | 29 | 30 | ----- 31 | 32 | # Using GEOquery to obtain microarray data 33 | 34 | First, set the working directory. 35 | 36 | The GEOquery package allows you to access the data from GEO. Depending 37 | on your needs, you can download only the processed data and metadata 38 | provided by the depositor. In some cases, you may want to download the 39 | raw data as well, if it was provided by the depositor. 40 | 41 | The function to download a GEO dataset is ‘getGEO’ from the ‘GEOquery’ 42 | package.Check how many platforms used for the GSE data, usually there 43 | will only be one platform. We set the first object in the list and gse 44 | now is an expressionSet. You can see that it contains assayData, 45 | phenoData, feature etc. 46 | 47 | We can have a look at the sample information, gene annotation, and the 48 | expression data. This allow us to have a rough idea of the information 49 | storing in this expressionSet. 50 | 51 | ``` r 52 | getwd() 53 | ``` 54 | 55 | ## [1] "C:/Users/Lynn/Documents/R_GEOdata" 56 | 57 | ``` r 58 | setwd("C:/Users/Lynn/Documents/R_GEOdata") 59 | 60 | ###https://sbc.shef.ac.uk/geo_tutorial/tutorial.nb.html#Introduction 61 | #----import the data------------------------------------ 62 | library(GEOquery) 63 | ``` 64 | 65 | ## Loading required package: Biobase 66 | 67 | ## Loading required package: BiocGenerics 68 | 69 | ## Loading required package: parallel 70 | 71 | ## 72 | ## Attaching package: 'BiocGenerics' 73 | 74 | ## The following objects are masked from 'package:parallel': 75 | ## 76 | ## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, 77 | ## clusterExport, clusterMap, parApply, parCapply, parLapply, 78 | ## parLapplyLB, parRapply, parSapply, parSapplyLB 79 | 80 | ## The following objects are masked from 'package:stats': 81 | ## 82 | ## IQR, mad, sd, var, xtabs 83 | 84 | ## The following objects are masked from 'package:base': 85 | ## 86 | ## anyDuplicated, append, as.data.frame, basename, cbind, colnames, 87 | ## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, 88 | ## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, 89 | ## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, 90 | ## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, 91 | ## union, unique, unsplit, which.max, which.min 92 | 93 | ## Welcome to Bioconductor 94 | ## 95 | ## Vignettes contain introductory material; view with 96 | ## 'browseVignettes()'. To cite Bioconductor, see 97 | ## 'citation("Biobase")', and for packages 'citation("pkgname")'. 98 | 99 | ## Setting options('download.file.method.GEOquery'='auto') 100 | 101 | ## Setting options('GEOquery.inmemory.gpl'=FALSE) 102 | 103 | ``` r 104 | my_id <- "GSE63477" 105 | gse <- getGEO(my_id) 106 | ``` 107 | 108 | ## Found 1 file(s) 109 | 110 | ## GSE63477_series_matrix.txt.gz 111 | 112 | ## Parsed with column specification: 113 | ## cols( 114 | ## ID_REF = col_double(), 115 | ## GSM1550559 = col_double(), 116 | ## GSM1550560 = col_double(), 117 | ## GSM1550561 = col_double(), 118 | ## GSM1550562 = col_double(), 119 | ## GSM1550563 = col_double(), 120 | ## GSM1550564 = col_double(), 121 | ## GSM1550565 = col_double(), 122 | ## GSM1550566 = col_double(), 123 | ## GSM1550567 = col_double(), 124 | ## GSM1550568 = col_double(), 125 | ## GSM1550569 = col_double(), 126 | ## GSM1550570 = col_double() 127 | ## ) 128 | 129 | ## File stored at: 130 | 131 | ## C:\Users\Lynn\AppData\Local\Temp\RtmpcbmFsU/GPL16686.soft 132 | 133 | ## Warning: 190 parsing failures. 134 | ## row col expected actual file 135 | ## 53792 ID a double AFFX-BioB-3_at literal data 136 | ## 53793 ID a double AFFX-BioB-3_st literal data 137 | ## 53794 ID a double AFFX-BioB-5_at literal data 138 | ## 53795 ID a double AFFX-BioB-5_st literal data 139 | ## 53796 ID a double AFFX-BioB-M_at literal data 140 | ## ..... ... ........ .............. ............ 141 | ## See problems(...) for more details. 142 | 143 | ``` r 144 | ## check how many platforms used 145 | length(gse) 146 | ``` 147 | 148 | ## [1] 1 149 | 150 | ``` r 151 | gse <-gse[[1]] 152 | gse 153 | ``` 154 | 155 | ## ExpressionSet (storageMode: lockedEnvironment) 156 | ## assayData: 44629 features, 12 samples 157 | ## element names: exprs 158 | ## protocolData: none 159 | ## phenoData 160 | ## sampleNames: GSM1550559 GSM1550560 ... GSM1550570 (12 total) 161 | ## varLabels: title geo_accession ... treatment:ch1 (35 total) 162 | ## varMetadata: labelDescription 163 | ## featureData 164 | ## featureNames: 16657436 16657440 ... 17118478 (44629 total) 165 | ## fvarLabels: ID RANGE_STRAND ... RANGE_GB (8 total) 166 | ## fvarMetadata: Column Description labelDescription 167 | ## experimentData: use 'experimentData(object)' 168 | ## Annotation: GPL16686 169 | 170 | ``` r 171 | pData(gse)[1:2,] ## print the sample information 172 | ``` 173 | 174 | ## title 175 | ## GSM1550559 LNCaP CTRL treated in charcoal dextran treated serum, replicate 1 176 | ## GSM1550560 LNCaP CTRL treated in charcoal dextran treated serum, replicate 2 177 | ## geo_accession status submission_date last_update_date 178 | ## GSM1550559 GSM1550559 Public on Jan 01 2015 Nov 19 2014 Jan 01 2015 179 | ## GSM1550560 GSM1550560 Public on Jan 01 2015 Nov 19 2014 Jan 01 2015 180 | ## type channel_count source_name_ch1 organism_ch1 characteristics_ch1 181 | ## GSM1550559 RNA 1 LNCaP Homo sapiens cell line: LNCaP 182 | ## GSM1550560 RNA 1 LNCaP Homo sapiens cell line: LNCaP 183 | ## characteristics_ch1.1 184 | ## GSM1550559 treatment: CTRL treated in charcoal dextran treated serum 185 | ## GSM1550560 treatment: CTRL treated in charcoal dextran treated serum 186 | ## treatment_protocol_ch1 187 | ## GSM1550559 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH) 188 | ## GSM1550560 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH) 189 | ## growth_protocol_ch1 molecule_ch1 190 | ## GSM1550559 Cells treated at ca. 80% confluency total RNA 191 | ## GSM1550560 Cells treated at ca. 80% confluency total RNA 192 | ## extract_protocol_ch1 193 | ## GSM1550559 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc) 194 | ## GSM1550560 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc) 195 | ## label_ch1 196 | ## GSM1550559 biotin 197 | ## GSM1550560 biotin 198 | ## label_protocol_ch1 199 | ## GSM1550559 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc) 200 | ## GSM1550560 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc) 201 | ## taxid_ch1 202 | ## GSM1550559 9606 203 | ## GSM1550560 9606 204 | ## hyb_protocol 205 | ## GSM1550559 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h. 206 | ## GSM1550560 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h. 207 | ## scan_protocol 208 | ## GSM1550559 Affymetrix Gene ChIP Scanner 3000 7G 209 | ## GSM1550560 Affymetrix Gene ChIP Scanner 3000 7G 210 | ## data_processing 211 | ## GSM1550559 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes. 212 | ## GSM1550560 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes. 213 | ## data_processing.1 data_processing.2 platform_id contact_name 214 | ## GSM1550559 HuGene-2_0-st.pgf HuGene-2_0-st.mps GPL16686 Karen,,Knudsen 215 | ## GSM1550560 HuGene-2_0-st.pgf HuGene-2_0-st.mps GPL16686 Karen,,Knudsen 216 | ## contact_institute 217 | ## GSM1550559 Thomas Jefferson University - Kimmel Cancer Center 218 | ## GSM1550560 Thomas Jefferson University - Kimmel Cancer Center 219 | ## contact_address contact_city contact_state 220 | ## GSM1550559 233 S 10th St, BLSB 1008 Philadelphia Pennsylvania 221 | ## GSM1550560 233 S 10th St, BLSB 1008 Philadelphia Pennsylvania 222 | ## contact_zip/postal_code contact_country 223 | ## GSM1550559 19107 USA 224 | ## GSM1550560 19107 USA 225 | ## supplementary_file 226 | ## GSM1550559 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550559/suppl/GSM1550559_01_LN_CDT-CTRL_1.CEL.gz 227 | ## GSM1550560 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550560/suppl/GSM1550560_01_LN_CTS-CTRL_2.CEL.gz 228 | ## data_row_count cell line:ch1 229 | ## GSM1550559 44629 LNCaP 230 | ## GSM1550560 44629 LNCaP 231 | ## treatment:ch1 232 | ## GSM1550559 CTRL treated in charcoal dextran treated serum 233 | ## GSM1550560 CTRL treated in charcoal dextran treated serum 234 | 235 | ``` r 236 | fData(gse)[1,] ## print the gene annotation 237 | ``` 238 | 239 | ## ID RANGE_STRAND RANGE_START RANGE_END total_probes GB_ACC 240 | ## 16657436 16657436 + 12190 13639 25 NR_046018 241 | ## SPOT_ID RANGE_GB 242 | ## 16657436 chr1:12190-13639 NC_000001.10 243 | 244 | ``` r 245 | exprs(gse)[1,] ## print the expression data 246 | ``` 247 | 248 | ## GSM1550559 GSM1550560 GSM1550561 GSM1550562 GSM1550563 GSM1550564 GSM1550565 249 | ## 24.63215 21.96198 24.36674 24.28032 24.82574 22.72258 25.76430 250 | ## GSM1550566 GSM1550567 GSM1550568 GSM1550569 GSM1550570 251 | ## 23.03947 24.45452 23.74868 23.77131 21.67176 252 | 253 | # Check the normalisation and scales used 254 | 255 | We can use this command to check the normalization method, to see if the 256 | data has already processed. So this expression data was RMA normalized 257 | and filtered to remove low-expressing genes. RMA means Robust Multiarray 258 | Average, it is the most common method to determine probeset expression 259 | level for Affymetrix arrays. 260 | 261 | The ‘summary’ function can then be used to print the distributions of 262 | expression levels, if the data has been log transformed, typically in 263 | the range of 0 to 16. Hmm…the values are quite big and go beyond 16. 264 | It’s quite weird because RMA should already log2 transform the data at 265 | the last step, but well, let’s do it on our own and move to the next 266 | step. For a more careful analysis, we can try to run the raw data of 267 | this dataset again, by applying RMA normalization on our own, to see if 268 | there is any difference. 269 | 270 | Anyway, here, let’s perform a log2 transformation. We may check the 271 | summary of expression level again. And draw a boxplot. We can see that 272 | the distributions of each sample are highly similar, which means the 273 | data have been normalised. 274 | 275 | ``` r 276 | pData(gse)$data_processing[1] 277 | ``` 278 | 279 | ## [1] "Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes." 280 | 281 | ``` r 282 | # For visualisation and statistical analysis, we will inspect the data to 283 | # discover what scale the data are presented in. The methods we will use assume 284 | # the data are on a log2 scale; typically in the range of 0 to 16. 285 | 286 | ## have a look on the expression value 287 | summary(exprs(gse)) 288 | ``` 289 | 290 | ## GSM1550559 GSM1550560 GSM1550561 GSM1550562 291 | ## Min. : 17.42 Min. : 17.49 Min. : 17.54 Min. : 17.44 292 | ## 1st Qu.: 19.59 1st Qu.: 19.53 1st Qu.: 19.59 1st Qu.: 19.45 293 | ## Median : 22.29 Median : 22.34 Median : 22.39 Median : 22.43 294 | ## Mean : 48.09 Mean : 48.51 Mean : 48.08 Mean : 48.91 295 | ## 3rd Qu.: 33.45 3rd Qu.: 33.90 3rd Qu.: 33.70 3rd Qu.: 33.92 296 | ## Max. :5889.83 Max. :6043.74 Max. :5907.55 Max. :6087.95 297 | ## GSM1550563 GSM1550564 GSM1550565 GSM1550566 298 | ## Min. : 17.48 Min. : 17.47 Min. : 17.43 Min. : 17.49 299 | ## 1st Qu.: 19.48 1st Qu.: 19.51 1st Qu.: 19.59 1st Qu.: 19.54 300 | ## Median : 22.19 Median : 22.22 Median : 22.38 Median : 22.29 301 | ## Mean : 49.16 Mean : 48.72 Mean : 48.29 Mean : 48.83 302 | ## 3rd Qu.: 33.70 3rd Qu.: 33.84 3rd Qu.: 33.49 3rd Qu.: 33.85 303 | ## Max. :5991.22 Max. :5827.09 Max. :5704.80 Max. :5938.88 304 | ## GSM1550567 GSM1550568 GSM1550569 GSM1550570 305 | ## Min. : 17.49 Min. : 17.32 Min. : 17.38 Min. : 17.38 306 | ## 1st Qu.: 19.54 1st Qu.: 19.58 1st Qu.: 19.41 1st Qu.: 19.52 307 | ## Median : 22.36 Median : 22.35 Median : 22.31 Median : 22.27 308 | ## Mean : 48.72 Mean : 48.31 Mean : 49.31 Mean : 48.94 309 | ## 3rd Qu.: 33.62 3rd Qu.: 33.58 3rd Qu.: 33.82 3rd Qu.: 33.71 310 | ## Max. :6140.01 Max. :6066.52 Max. :6307.27 Max. :5844.71 311 | 312 | ``` r 313 | # From this output we clearly see that the values go beyond 16, 314 | # so we need to perform a log2 transformation. 315 | exprs(gse) <- log2(exprs(gse)) 316 | 317 | # check again the summary 318 | summary(exprs(gse)) 319 | ``` 320 | 321 | ## GSM1550559 GSM1550560 GSM1550561 GSM1550562 322 | ## Min. : 4.122 Min. : 4.128 Min. : 4.133 Min. : 4.124 323 | ## 1st Qu.: 4.292 1st Qu.: 4.288 1st Qu.: 4.292 1st Qu.: 4.282 324 | ## Median : 4.479 Median : 4.482 Median : 4.485 Median : 4.488 325 | ## Mean : 4.887 Mean : 4.892 Mean : 4.887 Mean : 4.890 326 | ## 3rd Qu.: 5.064 3rd Qu.: 5.083 3rd Qu.: 5.075 3rd Qu.: 5.084 327 | ## Max. :12.524 Max. :12.561 Max. :12.528 Max. :12.572 328 | ## GSM1550563 GSM1550564 GSM1550565 GSM1550566 329 | ## Min. : 4.128 Min. : 4.127 Min. : 4.123 Min. : 4.128 330 | ## 1st Qu.: 4.284 1st Qu.: 4.286 1st Qu.: 4.292 1st Qu.: 4.288 331 | ## Median : 4.472 Median : 4.474 Median : 4.484 Median : 4.478 332 | ## Mean : 4.893 Mean : 4.892 Mean : 4.887 Mean : 4.892 333 | ## 3rd Qu.: 5.075 3rd Qu.: 5.081 3rd Qu.: 5.066 3rd Qu.: 5.081 334 | ## Max. :12.549 Max. :12.509 Max. :12.478 Max. :12.536 335 | ## GSM1550567 GSM1550568 GSM1550569 GSM1550570 336 | ## Min. : 4.128 Min. : 4.114 Min. : 4.119 Min. : 4.120 337 | ## 1st Qu.: 4.288 1st Qu.: 4.291 1st Qu.: 4.279 1st Qu.: 4.287 338 | ## Median : 4.483 Median : 4.482 Median : 4.479 Median : 4.477 339 | ## Mean : 4.891 Mean : 4.888 Mean : 4.891 Mean : 4.892 340 | ## 3rd Qu.: 5.071 3rd Qu.: 5.070 3rd Qu.: 5.080 3rd Qu.: 5.075 341 | ## Max. :12.584 Max. :12.567 Max. :12.623 Max. :12.513 342 | 343 | ``` r 344 | boxplot(exprs(gse),outline=F) 345 | ``` 346 | 347 | ![](GSE_analysis_microarray_files/figure-gfm/unnamed-chunk-2-1.png) 348 | 349 | # Inspect the clinical variables 350 | 351 | Now we try to look into the pData for the elements that we need for the 352 | analysis. We want to know the sample name, whether it is treatment or 353 | control…in this dataset, the info is stored in the column of 354 | ‘characteristics\_ch1.1’. 355 | 356 | We can use the select function to subset the column of interest. It will 357 | be useful also to rename the column to something more shorter and 358 | easier. 359 | 360 | To make a column of simplified group names for each sample, ‘Stringr’ is 361 | helpful. Two new columns are created, named “group” and “serum”. The 362 | function ‘str\_detect’ is to detect the presence of the words, and then 363 | fill the row accordingly. It totally depends on your dataset to make 364 | these necessary categories in the new columns, just modify these 365 | commands for your dataset of interest. 366 | 367 | ``` r 368 | library(dplyr) 369 | ``` 370 | 371 | ## 372 | ## Attaching package: 'dplyr' 373 | 374 | ## The following object is masked from 'package:Biobase': 375 | ## 376 | ## combine 377 | 378 | ## The following objects are masked from 'package:BiocGenerics': 379 | ## 380 | ## combine, intersect, setdiff, union 381 | 382 | ## The following objects are masked from 'package:stats': 383 | ## 384 | ## filter, lag 385 | 386 | ## The following objects are masked from 'package:base': 387 | ## 388 | ## intersect, setdiff, setequal, union 389 | 390 | ``` r 391 | sampleInfo <- pData(gse) 392 | head(sampleInfo) 393 | ``` 394 | 395 | ## title 396 | ## GSM1550559 LNCaP CTRL treated in charcoal dextran treated serum, replicate 1 397 | ## GSM1550560 LNCaP CTRL treated in charcoal dextran treated serum, replicate 2 398 | ## GSM1550561 LNCaP 16h cabazitaxel (1nM) treated in charcoal dextran treated serum, replicate 1 399 | ## GSM1550562 LNCaP 16h cabazitaxel (1nM) treated in charcoal dextran treated serum, replicate 2 400 | ## GSM1550563 LNCaP 16h docetaxel (1nM) treated in charcoal dextran treated serum, replicate 1 401 | ## GSM1550564 LNCaP 16h docetaxel (1nM) treated in charcoal dextran treated serum, replicate 2 402 | ## geo_accession status submission_date last_update_date 403 | ## GSM1550559 GSM1550559 Public on Jan 01 2015 Nov 19 2014 Jan 01 2015 404 | ## GSM1550560 GSM1550560 Public on Jan 01 2015 Nov 19 2014 Jan 01 2015 405 | ## GSM1550561 GSM1550561 Public on Jan 01 2015 Nov 19 2014 Jan 01 2015 406 | ## GSM1550562 GSM1550562 Public on Jan 01 2015 Nov 19 2014 Jan 01 2015 407 | ## GSM1550563 GSM1550563 Public on Jan 01 2015 Nov 19 2014 Jan 01 2015 408 | ## GSM1550564 GSM1550564 Public on Jan 01 2015 Nov 19 2014 Jan 01 2015 409 | ## type channel_count source_name_ch1 organism_ch1 characteristics_ch1 410 | ## GSM1550559 RNA 1 LNCaP Homo sapiens cell line: LNCaP 411 | ## GSM1550560 RNA 1 LNCaP Homo sapiens cell line: LNCaP 412 | ## GSM1550561 RNA 1 LNCaP Homo sapiens cell line: LNCaP 413 | ## GSM1550562 RNA 1 LNCaP Homo sapiens cell line: LNCaP 414 | ## GSM1550563 RNA 1 LNCaP Homo sapiens cell line: LNCaP 415 | ## GSM1550564 RNA 1 LNCaP Homo sapiens cell line: LNCaP 416 | ## characteristics_ch1.1 417 | ## GSM1550559 treatment: CTRL treated in charcoal dextran treated serum 418 | ## GSM1550560 treatment: CTRL treated in charcoal dextran treated serum 419 | ## GSM1550561 treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum 420 | ## GSM1550562 treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum 421 | ## GSM1550563 treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum 422 | ## GSM1550564 treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum 423 | ## treatment_protocol_ch1 424 | ## GSM1550559 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH) 425 | ## GSM1550560 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH) 426 | ## GSM1550561 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH) 427 | ## GSM1550562 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH) 428 | ## GSM1550563 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH) 429 | ## GSM1550564 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH) 430 | ## growth_protocol_ch1 molecule_ch1 431 | ## GSM1550559 Cells treated at ca. 80% confluency total RNA 432 | ## GSM1550560 Cells treated at ca. 80% confluency total RNA 433 | ## GSM1550561 Cells treated at ca. 80% confluency total RNA 434 | ## GSM1550562 Cells treated at ca. 80% confluency total RNA 435 | ## GSM1550563 Cells treated at ca. 80% confluency total RNA 436 | ## GSM1550564 Cells treated at ca. 80% confluency total RNA 437 | ## extract_protocol_ch1 438 | ## GSM1550559 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc) 439 | ## GSM1550560 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc) 440 | ## GSM1550561 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc) 441 | ## GSM1550562 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc) 442 | ## GSM1550563 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc) 443 | ## GSM1550564 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc) 444 | ## label_ch1 445 | ## GSM1550559 biotin 446 | ## GSM1550560 biotin 447 | ## GSM1550561 biotin 448 | ## GSM1550562 biotin 449 | ## GSM1550563 biotin 450 | ## GSM1550564 biotin 451 | ## label_protocol_ch1 452 | ## GSM1550559 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc) 453 | ## GSM1550560 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc) 454 | ## GSM1550561 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc) 455 | ## GSM1550562 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc) 456 | ## GSM1550563 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc) 457 | ## GSM1550564 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc) 458 | ## taxid_ch1 459 | ## GSM1550559 9606 460 | ## GSM1550560 9606 461 | ## GSM1550561 9606 462 | ## GSM1550562 9606 463 | ## GSM1550563 9606 464 | ## GSM1550564 9606 465 | ## hyb_protocol 466 | ## GSM1550559 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h. 467 | ## GSM1550560 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h. 468 | ## GSM1550561 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h. 469 | ## GSM1550562 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h. 470 | ## GSM1550563 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h. 471 | ## GSM1550564 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h. 472 | ## scan_protocol 473 | ## GSM1550559 Affymetrix Gene ChIP Scanner 3000 7G 474 | ## GSM1550560 Affymetrix Gene ChIP Scanner 3000 7G 475 | ## GSM1550561 Affymetrix Gene ChIP Scanner 3000 7G 476 | ## GSM1550562 Affymetrix Gene ChIP Scanner 3000 7G 477 | ## GSM1550563 Affymetrix Gene ChIP Scanner 3000 7G 478 | ## GSM1550564 Affymetrix Gene ChIP Scanner 3000 7G 479 | ## data_processing 480 | ## GSM1550559 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes. 481 | ## GSM1550560 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes. 482 | ## GSM1550561 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes. 483 | ## GSM1550562 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes. 484 | ## GSM1550563 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes. 485 | ## GSM1550564 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes. 486 | ## data_processing.1 data_processing.2 platform_id contact_name 487 | ## GSM1550559 HuGene-2_0-st.pgf HuGene-2_0-st.mps GPL16686 Karen,,Knudsen 488 | ## GSM1550560 HuGene-2_0-st.pgf HuGene-2_0-st.mps GPL16686 Karen,,Knudsen 489 | ## GSM1550561 HuGene-2_0-st.pgf HuGene-2_0-st.mps GPL16686 Karen,,Knudsen 490 | ## GSM1550562 HuGene-2_0-st.pgf HuGene-2_0-st.mps GPL16686 Karen,,Knudsen 491 | ## GSM1550563 HuGene-2_0-st.pgf HuGene-2_0-st.mps GPL16686 Karen,,Knudsen 492 | ## GSM1550564 HuGene-2_0-st.pgf HuGene-2_0-st.mps GPL16686 Karen,,Knudsen 493 | ## contact_institute 494 | ## GSM1550559 Thomas Jefferson University - Kimmel Cancer Center 495 | ## GSM1550560 Thomas Jefferson University - Kimmel Cancer Center 496 | ## GSM1550561 Thomas Jefferson University - Kimmel Cancer Center 497 | ## GSM1550562 Thomas Jefferson University - Kimmel Cancer Center 498 | ## GSM1550563 Thomas Jefferson University - Kimmel Cancer Center 499 | ## GSM1550564 Thomas Jefferson University - Kimmel Cancer Center 500 | ## contact_address contact_city contact_state 501 | ## GSM1550559 233 S 10th St, BLSB 1008 Philadelphia Pennsylvania 502 | ## GSM1550560 233 S 10th St, BLSB 1008 Philadelphia Pennsylvania 503 | ## GSM1550561 233 S 10th St, BLSB 1008 Philadelphia Pennsylvania 504 | ## GSM1550562 233 S 10th St, BLSB 1008 Philadelphia Pennsylvania 505 | ## GSM1550563 233 S 10th St, BLSB 1008 Philadelphia Pennsylvania 506 | ## GSM1550564 233 S 10th St, BLSB 1008 Philadelphia Pennsylvania 507 | ## contact_zip/postal_code contact_country 508 | ## GSM1550559 19107 USA 509 | ## GSM1550560 19107 USA 510 | ## GSM1550561 19107 USA 511 | ## GSM1550562 19107 USA 512 | ## GSM1550563 19107 USA 513 | ## GSM1550564 19107 USA 514 | ## supplementary_file 515 | ## GSM1550559 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550559/suppl/GSM1550559_01_LN_CDT-CTRL_1.CEL.gz 516 | ## GSM1550560 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550560/suppl/GSM1550560_01_LN_CTS-CTRL_2.CEL.gz 517 | ## GSM1550561 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550561/suppl/GSM1550561_02_LN-CDT_CBTX_2.CEL.gz 518 | ## GSM1550562 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550562/suppl/GSM1550562_02_LN_CDT-CBTX_1.CEL.gz 519 | ## GSM1550563 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550563/suppl/GSM1550563_03_LN-CDT-DCTX_1.CEL.gz 520 | ## GSM1550564 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550564/suppl/GSM1550564_03_LN-CDT-DCTX_2.CEL.gz 521 | ## data_row_count cell line:ch1 522 | ## GSM1550559 44629 LNCaP 523 | ## GSM1550560 44629 LNCaP 524 | ## GSM1550561 44629 LNCaP 525 | ## GSM1550562 44629 LNCaP 526 | ## GSM1550563 44629 LNCaP 527 | ## GSM1550564 44629 LNCaP 528 | ## treatment:ch1 529 | ## GSM1550559 CTRL treated in charcoal dextran treated serum 530 | ## GSM1550560 CTRL treated in charcoal dextran treated serum 531 | ## GSM1550561 16h cabazitaxel (1nM) treated in charcoal dextran treated serum 532 | ## GSM1550562 16h cabazitaxel (1nM) treated in charcoal dextran treated serum 533 | ## GSM1550563 16h docetaxel (1nM) treated in charcoal dextran treated serum 534 | ## GSM1550564 16h docetaxel (1nM) treated in charcoal dextran treated serum 535 | 536 | ``` r 537 | table(sampleInfo$characteristics_ch1.1) 538 | ``` 539 | 540 | ## 541 | ## treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum 542 | ## 2 543 | ## treatment: 16h cabazitaxel (1nM) treated in full serum 544 | ## 2 545 | ## treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum 546 | ## 2 547 | ## treatment: 16h docetaxel (1nM) treated in full serum 548 | ## 2 549 | ## treatment: CTRL treated in charcoal dextran treated serum 550 | ## 2 551 | ## treatment: CTRL treated in full serum 552 | ## 2 553 | 554 | ``` r 555 | #Let's pick just those columns that seem to contain factors we might 556 | #need for the analysis. 557 | sampleInfo <- select(sampleInfo, characteristics_ch1.1) 558 | 559 | ## Optionally, rename to more convenient column names 560 | sampleInfo <- rename(sampleInfo, sample = characteristics_ch1.1) 561 | 562 | head(sampleInfo) 563 | ``` 564 | 565 | ## sample 566 | ## GSM1550559 treatment: CTRL treated in charcoal dextran treated serum 567 | ## GSM1550560 treatment: CTRL treated in charcoal dextran treated serum 568 | ## GSM1550561 treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum 569 | ## GSM1550562 treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum 570 | ## GSM1550563 treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum 571 | ## GSM1550564 treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum 572 | 573 | ``` r 574 | dim(sampleInfo) 575 | ``` 576 | 577 | ## [1] 12 1 578 | 579 | ``` r 580 | sampleInfo$sample 581 | ``` 582 | 583 | ## [1] "treatment: CTRL treated in charcoal dextran treated serum" 584 | ## [2] "treatment: CTRL treated in charcoal dextran treated serum" 585 | ## [3] "treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum" 586 | ## [4] "treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum" 587 | ## [5] "treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum" 588 | ## [6] "treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum" 589 | ## [7] "treatment: CTRL treated in full serum" 590 | ## [8] "treatment: CTRL treated in full serum" 591 | ## [9] "treatment: 16h cabazitaxel (1nM) treated in full serum" 592 | ## [10] "treatment: 16h cabazitaxel (1nM) treated in full serum" 593 | ## [11] "treatment: 16h docetaxel (1nM) treated in full serum" 594 | ## [12] "treatment: 16h docetaxel (1nM) treated in full serum" 595 | 596 | ``` r 597 | library(stringr) 598 | sampleInfo$group <- "" 599 | for(i in 1:nrow(sampleInfo)){ 600 | if(str_detect(sampleInfo$sample[i], "CTRL") && str_detect(sampleInfo$sample[i], "full")) 601 | {sampleInfo$group[i] <- "Conf"} 602 | 603 | if(str_detect(sampleInfo$sample[i], "CTRL") && str_detect(sampleInfo$sample[i], "dextran")) 604 | {sampleInfo$group[i] <- "Cond"} 605 | 606 | if(str_detect(sampleInfo$sample[i], "cabazitaxel") && str_detect(sampleInfo$sample[i], "full")) 607 | {sampleInfo$group[i] <- "cabazitaxelf"} 608 | 609 | if(str_detect(sampleInfo$sample[i], "cabazitaxel") && str_detect(sampleInfo$sample[i], "dextran")) 610 | {sampleInfo$group[i] <- "cabazitaxeld"} 611 | 612 | if(str_detect(sampleInfo$sample[i], "docetaxel") && str_detect(sampleInfo$sample[i], "full")) 613 | {sampleInfo$group[i] <- "docetaxelf"} 614 | 615 | if(str_detect(sampleInfo$sample[i], "docetaxel") && str_detect(sampleInfo$sample[i], "dextran")) 616 | {sampleInfo$group[i] <- "docetaxeld"} 617 | } 618 | 619 | sampleInfo 620 | ``` 621 | 622 | ## sample 623 | ## GSM1550559 treatment: CTRL treated in charcoal dextran treated serum 624 | ## GSM1550560 treatment: CTRL treated in charcoal dextran treated serum 625 | ## GSM1550561 treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum 626 | ## GSM1550562 treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum 627 | ## GSM1550563 treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum 628 | ## GSM1550564 treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum 629 | ## GSM1550565 treatment: CTRL treated in full serum 630 | ## GSM1550566 treatment: CTRL treated in full serum 631 | ## GSM1550567 treatment: 16h cabazitaxel (1nM) treated in full serum 632 | ## GSM1550568 treatment: 16h cabazitaxel (1nM) treated in full serum 633 | ## GSM1550569 treatment: 16h docetaxel (1nM) treated in full serum 634 | ## GSM1550570 treatment: 16h docetaxel (1nM) treated in full serum 635 | ## group 636 | ## GSM1550559 Cond 637 | ## GSM1550560 Cond 638 | ## GSM1550561 cabazitaxeld 639 | ## GSM1550562 cabazitaxeld 640 | ## GSM1550563 docetaxeld 641 | ## GSM1550564 docetaxeld 642 | ## GSM1550565 Conf 643 | ## GSM1550566 Conf 644 | ## GSM1550567 cabazitaxelf 645 | ## GSM1550568 cabazitaxelf 646 | ## GSM1550569 docetaxelf 647 | ## GSM1550570 docetaxelf 648 | 649 | ``` r 650 | sampleInfo$serum <- "" 651 | for(i in 1:nrow(sampleInfo)){ 652 | if(str_detect(sampleInfo$sample[i], "dextran")) 653 | {sampleInfo$serum[i] <- "dextran"} 654 | 655 | if(str_detect(sampleInfo$sample[i], "full")) 656 | {sampleInfo$serum[i] <- "full_serum"} 657 | 658 | } 659 | 660 | sampleInfo <- sampleInfo[,-1] 661 | sampleInfo 662 | ``` 663 | 664 | ## group serum 665 | ## GSM1550559 Cond dextran 666 | ## GSM1550560 Cond dextran 667 | ## GSM1550561 cabazitaxeld dextran 668 | ## GSM1550562 cabazitaxeld dextran 669 | ## GSM1550563 docetaxeld dextran 670 | ## GSM1550564 docetaxeld dextran 671 | ## GSM1550565 Conf full_serum 672 | ## GSM1550566 Conf full_serum 673 | ## GSM1550567 cabazitaxelf full_serum 674 | ## GSM1550568 cabazitaxelf full_serum 675 | ## GSM1550569 docetaxelf full_serum 676 | ## GSM1550570 docetaxelf full_serum 677 | 678 | # Sample clustering and Principal Components Analaysis 679 | 680 | We can visualize the correlations between the samples by hierarchical 681 | clustering. 682 | 683 | The function ‘cor’ can calculate the correlation on the scale of 0 to 1, 684 | in a pairwise fashion between all samples, then visualise on a heatmap. 685 | There are many ways to create heatmaps in R, here I use ‘pheatmap’, the 686 | only argument it requires is a matrix of numeric values. 687 | 688 | We can add more sample info onto the plot to get a better pic of the 689 | group and clustering. Here, we make use of the ‘sampleInfo’ file that 690 | was created earlier, to match with the columns of the correlation 691 | matrix. 692 | 693 | ``` r 694 | library(pheatmap) 695 | ## argument use="c" stops an error if there are any missing data points 696 | 697 | corMatrix <- cor(exprs(gse),use="c") 698 | pheatmap(corMatrix) 699 | ``` 700 | 701 | ![](GSE_analysis_microarray_files/figure-gfm/unnamed-chunk-4-1.png) 702 | 703 | ``` r 704 | ## Print the rownames of the sample information and check it matches the correlation matrix 705 | 706 | rownames(sampleInfo) 707 | ``` 708 | 709 | ## [1] "GSM1550559" "GSM1550560" "GSM1550561" "GSM1550562" "GSM1550563" 710 | ## [6] "GSM1550564" "GSM1550565" "GSM1550566" "GSM1550567" "GSM1550568" 711 | ## [11] "GSM1550569" "GSM1550570" 712 | 713 | ``` r 714 | colnames(corMatrix) 715 | ``` 716 | 717 | ## [1] "GSM1550559" "GSM1550560" "GSM1550561" "GSM1550562" "GSM1550563" 718 | ## [6] "GSM1550564" "GSM1550565" "GSM1550566" "GSM1550567" "GSM1550568" 719 | ## [11] "GSM1550569" "GSM1550570" 720 | 721 | ``` r 722 | ## If not, force the rownames to match the columns 723 | #rownames(sampleInfo) <- colnames(corMatrix) 724 | 725 | pheatmap(corMatrix, annotation_col= sampleInfo) 726 | ``` 727 | 728 | ![](GSE_analysis_microarray_files/figure-gfm/unnamed-chunk-4-2.png) 729 | 730 | Another way is to use Principal component analysis (PCA). It has to note 731 | that the data has to be transposed, so that the genelist is in the 732 | column, while rownames are the samples, so the PCA process will not run 733 | out of the memory in the oher way round. 734 | 735 | Let’s add labels to plot the results, here, we use the ‘ggplots2’ 736 | package, while the ‘ggrepel’ package is used to position the text labels 737 | more cleverly so they can be read. Here we can see that the samples are 738 | divided into two groups based on the serum treatment types. 739 | 740 | ``` r 741 | #make PCA 742 | library(ggplot2) 743 | library(ggrepel) 744 | ## MAKE SURE TO TRANSPOSE THE EXPRESSION MATRIX 745 | 746 | pca <- prcomp(t(exprs(gse))) 747 | 748 | ## Join the PCs to the sample information 749 | cbind(sampleInfo, pca$x) %>% 750 | ggplot(aes(x = PC1, y=PC2, col=group, label=paste("",group))) + geom_point() + geom_text_repel() 751 | ``` 752 | 753 | ![](GSE_analysis_microarray_files/figure-gfm/unnamed-chunk-5-1.png) 754 | 755 | # Differential expression analysis 756 | 757 | In this section, we use the limma package to perform differential 758 | expressions. Limma stands for “Linear models for microarray”. Here, we 759 | need to tell limma what sample groups we want to compare. I choose 760 | sampleInfo$group. A design matrix will be created, this is a matrix of 0 761 | and 1, one row for each sample and one column for each sample group. 762 | 763 | We can rename the column names so that it is easier to see. 764 | 765 | Now, let’s check if the expression data contain any lowly-expressed 766 | genes, this will affect the quality of DE analysis. A big problem in 767 | doing statistical analysis like limma is the inference of type 1 768 | statistical errors, also called false positive. One simple way to reduce 769 | the possibility for type 1 errors is to do fewer comparisons, by 770 | filtering the data. For example, we know that not all genes are 771 | expressed in all tissues and many genes will not be expressed in any 772 | sample. As a result, in DGE analysis, it makes sense to remove the genes 773 | that are likely not expressed at all. 774 | 775 | It is quite subjective how one defines a gene being expressed, here, I 776 | follow the tutorial, to make the cut off at the median of the expression 777 | values, which means to consider around 50% of the genes will not be 778 | expressed. Keep those expressed genes if they are present in more than 2 779 | samples. 780 | 781 | We can see that around half of the genes are not qualified as an 782 | “expressed” gene here, which makes sense, bcoz our cut-off is the 783 | median value. 784 | 785 | ``` r 786 | library(limma) 787 | ``` 788 | 789 | ## 790 | ## Attaching package: 'limma' 791 | 792 | ## The following object is masked from 'package:BiocGenerics': 793 | ## 794 | ## plotMA 795 | 796 | ``` r 797 | design <- model.matrix(~0 + sampleInfo$group) 798 | design 799 | ``` 800 | 801 | ## sampleInfo$groupcabazitaxeld sampleInfo$groupcabazitaxelf 802 | ## 1 0 0 803 | ## 2 0 0 804 | ## 3 1 0 805 | ## 4 1 0 806 | ## 5 0 0 807 | ## 6 0 0 808 | ## 7 0 0 809 | ## 8 0 0 810 | ## 9 0 1 811 | ## 10 0 1 812 | ## 11 0 0 813 | ## 12 0 0 814 | ## sampleInfo$groupCond sampleInfo$groupConf sampleInfo$groupdocetaxeld 815 | ## 1 1 0 0 816 | ## 2 1 0 0 817 | ## 3 0 0 0 818 | ## 4 0 0 0 819 | ## 5 0 0 1 820 | ## 6 0 0 1 821 | ## 7 0 1 0 822 | ## 8 0 1 0 823 | ## 9 0 0 0 824 | ## 10 0 0 0 825 | ## 11 0 0 0 826 | ## 12 0 0 0 827 | ## sampleInfo$groupdocetaxelf 828 | ## 1 0 829 | ## 2 0 830 | ## 3 0 831 | ## 4 0 832 | ## 5 0 833 | ## 6 0 834 | ## 7 0 835 | ## 8 0 836 | ## 9 0 837 | ## 10 0 838 | ## 11 1 839 | ## 12 1 840 | ## attr(,"assign") 841 | ## [1] 1 1 1 1 1 1 842 | ## attr(,"contrasts") 843 | ## attr(,"contrasts")$`sampleInfo$group` 844 | ## [1] "contr.treatment" 845 | 846 | ``` r 847 | ## the column names are a bit ugly, so we will rename 848 | colnames(design) <- c("Cabazitaxeld","Cabazitaxelf","Cond","Conf","Docetaxeld","Docetaxelf") 849 | 850 | design 851 | ``` 852 | 853 | ## Cabazitaxeld Cabazitaxelf Cond Conf Docetaxeld Docetaxelf 854 | ## 1 0 0 1 0 0 0 855 | ## 2 0 0 1 0 0 0 856 | ## 3 1 0 0 0 0 0 857 | ## 4 1 0 0 0 0 0 858 | ## 5 0 0 0 0 1 0 859 | ## 6 0 0 0 0 1 0 860 | ## 7 0 0 0 1 0 0 861 | ## 8 0 0 0 1 0 0 862 | ## 9 0 1 0 0 0 0 863 | ## 10 0 1 0 0 0 0 864 | ## 11 0 0 0 0 0 1 865 | ## 12 0 0 0 0 0 1 866 | ## attr(,"assign") 867 | ## [1] 1 1 1 1 1 1 868 | ## attr(,"contrasts") 869 | ## attr(,"contrasts")$`sampleInfo$group` 870 | ## [1] "contr.treatment" 871 | 872 | ``` r 873 | ## calculate median expression level 874 | cutoff <- median(exprs(gse)) 875 | 876 | ## TRUE or FALSE for whether each gene is "expressed" in each sample 877 | is_expressed <- exprs(gse) > cutoff 878 | 879 | ## Identify genes expressed in more than 2 samples 880 | 881 | keep <- rowSums(is_expressed) > 3 882 | 883 | ## check how many genes are removed / retained. 884 | table(keep) 885 | ``` 886 | 887 | ## keep 888 | ## FALSE TRUE 889 | ## 20965 23664 890 | 891 | ``` r 892 | ## subset to just those expressed genes 893 | gse <- gse[keep,] 894 | ``` 895 | 896 | Here there is a little extra step to find out the outliers. This has to 897 | be done carefully so the filtered data won’t be too biased. We calculate 898 | ‘weights’ to define the reliability of each sample. The ‘arrayweights’ 899 | function will assign a score to each sample, with a value of 1 implying 900 | equal weight. Samples with score less than 1 are down-weighed, or else 901 | up-weighed. 902 | 903 | ``` r 904 | # coping with outliers 905 | ## calculate relative array weights 906 | aw <- arrayWeights(exprs(gse),design) 907 | aw 908 | ``` 909 | 910 | ## 1 2 3 4 5 6 7 8 911 | ## 0.9704842 0.9704842 0.8790788 0.8790788 0.9799805 0.9799805 0.8296341 0.8296341 912 | ## 9 10 11 12 913 | ## 1.1632620 1.1632620 1.2393733 1.2393733 914 | 915 | Now we have a design matrix, we need to estimate the coefficients. For 916 | this design, we will essentially average the replicate arrays for each 917 | sample level. In addition, we will calculate standard deviations for 918 | each gene, and the average intensity for the genes across all 919 | microarrays. 920 | 921 | We are ready to tell limma which pairwise contrasts that we want to 922 | make. For this experiment, we are going to contrast treatment (there are 923 | two types of texane drugs) and control in each serum type. So there are 924 | 4 contrasts to specify. 925 | 926 | To do the statistical comparisons, Limma uses Bayesian statistics to 927 | minimize type 1 error. The eBayes function performs the tests. To 928 | summarize the results of the statistical test, ‘topTable’ will adjust 929 | the p-values and return the top genes that meet the cutoffs that you 930 | supply as arguments; while ‘decideTests’ will make calls for DEGs by 931 | adjusting the p-values and applying a logFC cutoff similar to topTable. 932 | 933 | ``` r 934 | ## Fitting the coefficients 935 | fit <- lmFit(exprs(gse), design, 936 | weights = aw) 937 | 938 | head(fit$coefficients) 939 | ``` 940 | 941 | ## Cabazitaxeld Cabazitaxelf Cond Conf Docetaxeld Docetaxelf 942 | ## 16657436 4.604278 4.590902 4.539703 4.606668 4.569910 4.504447 943 | ## 16657440 5.073400 5.194166 5.274991 5.001295 5.230574 5.016408 944 | ## 16657450 6.606782 6.532054 6.571707 6.754964 6.854709 6.495389 945 | ## 16657469 5.318536 5.298899 5.358629 5.191262 5.398528 5.356415 946 | ## 16657476 5.410581 5.443410 5.358263 5.343908 5.516173 5.335881 947 | ## 16657480 4.481099 4.454008 4.487220 4.453781 4.418727 4.379055 948 | 949 | ``` r 950 | ## Making comparisons between samples, can define multiple contrasts 951 | contrasts <- makeContrasts(Docetaxeld - Cond, Cabazitaxeld - Cond, Docetaxelf - Conf, Cabazitaxelf - Conf, levels = design) 952 | 953 | fit2 <- contrasts.fit(fit, contrasts) 954 | fit2 <- eBayes(fit2) 955 | 956 | 957 | topTable(fit2) 958 | ``` 959 | 960 | ## Docetaxeld...Cond Cabazitaxeld...Cond Docetaxelf...Conf 961 | ## 16681891 -0.427866483 0.55288766 -0.02322651 962 | ## 16840609 0.573090433 -0.20083754 -0.21477778 963 | ## 17017165 0.116986540 -0.32582451 -0.44959204 964 | ## 16782010 0.178206104 -0.01891042 -0.29766696 965 | ## 17099705 -0.437302094 0.02759174 -0.16952565 966 | ## 16691877 -0.149430506 -0.18664917 -0.90866405 967 | ## 16959582 -0.006190599 -0.17371327 0.40620159 968 | ## 16970902 -0.416408859 0.01679171 -0.02017417 969 | ## 16936214 -0.112315413 0.12707501 -0.28387277 970 | ## 17009126 -0.645309985 -0.59167505 -0.39348344 971 | ## Cabazitaxelf...Conf AveExpr F P.Value adj.P.Val 972 | ## 16681891 -0.1414628 5.625598 43.01843 2.992480e-06 0.07081404 973 | ## 16840609 -0.3278230 5.769203 18.87958 1.217387e-04 0.99909331 974 | ## 17017165 -0.3710638 5.244400 14.18263 4.059196e-04 0.99909331 975 | ## 16782010 0.2850105 4.440952 14.12466 4.128114e-04 0.99909331 976 | ## 17099705 0.2776297 6.750407 13.62541 4.783683e-04 0.99909331 977 | ## 16691877 -0.5849898 4.667038 11.52373 9.376105e-04 0.99909331 978 | ## 16959582 0.3098904 5.531461 11.26699 1.024647e-03 0.99909331 979 | ## 16970902 -0.1618155 5.237839 11.18970 1.052724e-03 0.99909331 980 | ## 16936214 -0.6271607 5.482977 11.17492 1.058196e-03 0.99909331 981 | ## 17009126 -0.3389764 4.940635 11.00507 1.123616e-03 0.99909331 982 | 983 | ``` r 984 | topTable1 <- topTable(fit2, coef=1) 985 | topTable2 <- topTable(fit2, coef=2) 986 | topTable3 <- topTable(fit2, coef=3) 987 | topTable4 <- topTable(fit2, coef=4) 988 | 989 | #if we want to know how many genes are differentially expressed overall, we can use the decideTest function. 990 | summary(decideTests(fit2)) 991 | ``` 992 | 993 | ## Docetaxeld - Cond Cabazitaxeld - Cond Docetaxelf - Conf 994 | ## Down 0 0 0 995 | ## NotSig 23664 23664 23664 996 | ## Up 0 0 0 997 | ## Cabazitaxelf - Conf 998 | ## Down 0 999 | ## NotSig 23664 1000 | ## Up 0 1001 | 1002 | ``` r 1003 | table(decideTests(fit2)) 1004 | ``` 1005 | 1006 | ## 1007 | ## 0 1008 | ## 94656 1009 | 1010 | # Further visualization with gene annotation 1011 | 1012 | Now we want to know the gene name associated with the gene ID. The 1013 | annotation data can be retrieved with the ‘fData’ function. Let’s select 1014 | the ID, GB\_ACC, this is genbank accession ID. Add into fit2 table. 1015 | 1016 | The “Volcano Plot” function is a common way of visualising the results 1017 | of a DE analysis. The x axis shows the log-fold change and the y axis is 1018 | some measure of statistical significance, which in this case is the 1019 | log-odds, or “B” statistic. We can also change the color of those genes 1020 | with p value cutoff more than 0.05, and fold change cut off more than 1. 1021 | 1022 | ``` r 1023 | anno <- fData(gse) 1024 | head(anno) 1025 | ``` 1026 | 1027 | ## ID RANGE_STRAND RANGE_START RANGE_END total_probes GB_ACC 1028 | ## 16657436 16657436 + 12190 13639 25 NR_046018 1029 | ## 16657440 16657440 + 29554 31109 28 1030 | ## 16657450 16657450 + 317811 328581 36 NR_024368 1031 | ## 16657469 16657469 + 329790 342507 27 1032 | ## 16657476 16657476 + 459656 461954 27 NR_029406 1033 | ## 16657480 16657480 + 523009 532878 12 1034 | ## SPOT_ID RANGE_GB 1035 | ## 16657436 chr1:12190-13639 NC_000001.10 1036 | ## 16657440 chr1:29554-31109 NC_000001.10 1037 | ## 16657450 chr1:317811-328581 NC_000001.10 1038 | ## 16657469 chr1:329790-342507 NC_000001.10 1039 | ## 16657476 chr1:459656-461954 NC_000001.10 1040 | ## 16657480 chr1:523009-532878 NC_000001.10 1041 | 1042 | ``` r 1043 | anno <- select(anno,ID,GB_ACC) 1044 | fit2$genes <- anno 1045 | 1046 | topTable(fit2) 1047 | ``` 1048 | 1049 | ## ID GB_ACC Docetaxeld...Cond Cabazitaxeld...Cond 1050 | ## 16681891 16681891 NM_001013692 -0.427866483 0.55288766 1051 | ## 16840609 16840609 0.573090433 -0.20083754 1052 | ## 17017165 17017165 0.116986540 -0.32582451 1053 | ## 16782010 16782010 0.178206104 -0.01891042 1054 | ## 17099705 17099705 NR_039696 -0.437302094 0.02759174 1055 | ## 16691877 16691877 -0.149430506 -0.18664917 1056 | ## 16959582 16959582 -0.006190599 -0.17371327 1057 | ## 16970902 16970902 -0.416408859 0.01679171 1058 | ## 16936214 16936214 NR_037440 -0.112315413 0.12707501 1059 | ## 17009126 17009126 -0.645309985 -0.59167505 1060 | ## Docetaxelf...Conf Cabazitaxelf...Conf AveExpr F P.Value 1061 | ## 16681891 -0.02322651 -0.1414628 5.625598 43.01843 2.992480e-06 1062 | ## 16840609 -0.21477778 -0.3278230 5.769203 18.87958 1.217387e-04 1063 | ## 17017165 -0.44959204 -0.3710638 5.244400 14.18263 4.059196e-04 1064 | ## 16782010 -0.29766696 0.2850105 4.440952 14.12466 4.128114e-04 1065 | ## 17099705 -0.16952565 0.2776297 6.750407 13.62541 4.783683e-04 1066 | ## 16691877 -0.90866405 -0.5849898 4.667038 11.52373 9.376105e-04 1067 | ## 16959582 0.40620159 0.3098904 5.531461 11.26699 1.024647e-03 1068 | ## 16970902 -0.02017417 -0.1618155 5.237839 11.18970 1.052724e-03 1069 | ## 16936214 -0.28387277 -0.6271607 5.482977 11.17492 1.058196e-03 1070 | ## 17009126 -0.39348344 -0.3389764 4.940635 11.00507 1.123616e-03 1071 | ## adj.P.Val 1072 | ## 16681891 0.07081404 1073 | ## 16840609 0.99909331 1074 | ## 17017165 0.99909331 1075 | ## 16782010 0.99909331 1076 | ## 17099705 0.99909331 1077 | ## 16691877 0.99909331 1078 | ## 16959582 0.99909331 1079 | ## 16970902 0.99909331 1080 | ## 16936214 0.99909331 1081 | ## 17009126 0.99909331 1082 | 1083 | ``` r 1084 | ## Create volcano plot 1085 | full_results1 <- topTable(fit2, coef=1, number=Inf) 1086 | library(ggplot2) 1087 | ggplot(full_results1,aes(x = logFC, y=B)) + geom_point() 1088 | ``` 1089 | 1090 | ![](GSE_analysis_microarray_files/figure-gfm/unnamed-chunk-9-1.png) 1091 | 1092 | ``` r 1093 | ## change according to your needs 1094 | p_cutoff <- 0.05 1095 | fc_cutoff <- 1 1096 | 1097 | 1098 | full_results1 %>% 1099 | mutate(Significant = P.Value < p_cutoff, abs(logFC) > fc_cutoff ) %>% 1100 | ggplot(aes(x = logFC, y = B, col=Significant)) + geom_point() 1101 | ``` 1102 | 1103 | ![](GSE_analysis_microarray_files/figure-gfm/unnamed-chunk-9-2.png) 1104 | 1105 | # Further visualization of selected gene 1106 | 1107 | I think at this point, we are quite clear about data structure of GSE 1108 | data. It has an experiment data, pData; the expression data, exprs; and 1109 | also annotation data, fData. And we have learned how to check the 1110 | expression data, normalize them, and perform differential expression 1111 | analysis. 1112 | 1113 | Now, with the differential expression gene tables, there are some 1114 | downstream analyses that we can continue, such as to export a full table 1115 | of DE genes, to generate a heatmap for your selected genes, get the gene 1116 | list for a particular pathway, or survival analysis (but this is only 1117 | for those clinical data). 1118 | 1119 | Here, I just want to look into the fold change data of a selected gene, 1120 | whether it is significantly differential expressed or not. 1121 | 1122 | ``` r 1123 | ## Get the results for particular gene of interest 1124 | #GB_ACC for Nkx3-1 is NM_001256339 or NM_006167 1125 | ##no NM_001256339 in this data 1126 | full_results2 <- topTable(fit2, coef=2, number=Inf) 1127 | full_results3 <- topTable(fit2, coef=3, number=Inf) 1128 | full_results4 <- topTable(fit2, coef=4, number=Inf) 1129 | filter(full_results1, GB_ACC == "NM_006167") 1130 | ``` 1131 | 1132 | ## ID GB_ACC logFC AveExpr t P.Value adj.P.Val 1133 | ## 17075536 17075536 NM_006167 -0.2379718 8.055042 -3.056052 0.01219317 0.9819604 1134 | ## B 1135 | ## 17075536 -3.374675 1136 | 1137 | ``` r 1138 | filter(full_results2, GB_ACC == "NM_006167") 1139 | ``` 1140 | 1141 | ## ID GB_ACC logFC AveExpr t P.Value adj.P.Val 1142 | ## 17075536 17075536 NM_006167 -0.09964001 8.055042 -1.244539 0.2418169 0.9999406 1143 | ## B 1144 | ## 17075536 -4.561219 1145 | 1146 | ``` r 1147 | filter(full_results3, GB_ACC == "NM_006167") 1148 | ``` 1149 | 1150 | ## ID GB_ACC logFC AveExpr t P.Value adj.P.Val 1151 | ## 17075536 17075536 NM_006167 0.02902121 8.055042 0.3762533 0.7146269 0.9999674 1152 | ## B 1153 | ## 17075536 -4.926393 1154 | 1155 | ``` r 1156 | filter(full_results4, GB_ACC == "NM_006167") 1157 | ``` 1158 | 1159 | ## ID GB_ACC logFC AveExpr t P.Value adj.P.Val 1160 | ## 17075536 17075536 NM_006167 0.01019426 8.055042 0.1304658 0.8987979 0.9998834 1161 | ## B 1162 | ## 17075536 -4.95523 1163 | 1164 | That’s all for the walk-through, thanks for reading, I hope you have 1165 | learned something new here. 1166 | 1167 | # Acknowlegdement 1168 | 1169 | Many thanks to the following tutorials made publicly available: 1170 | 1171 | 1. Introduction to microarray analysis GSE15947, by Department of 1172 | Statistics, Purdue Univrsity 1173 | 1174 | 1175 | 2. Mark Dunning, 2020, GEO tutorial, by Sheffield Bioinformatics Core 1176 | 1177 | --------------------------------------------------------------------------------