├── GSE_analysis_microarray.Rmd
├── Info.txt
└── README.md


/GSE_analysis_microarray.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "GSE analysis for microarray data"
  3 | author: "Lian Chee Foong"
  4 | date: "2/3/2021"
  5 | output: 
  6 |   pdf_document:
  7 |     df_print: tibble
  8 |     toc: true
  9 |     number_sections: true
 10 |   highlight: tango
 11 |   fig_width: 3
 12 |   fig_height: 1.5
 13 | ---
 14 | 
 15 | 
 16 | ```{r setup, include=FALSE}
 17 | knitr::opts_chunk$set(echo = TRUE)
 18 | ```
 19 | 
 20 | ***
 21 | 
 22 | # Introduction
 23 | 
 24 | This R script is to demonstrate the steps to download GSE data from NCBI GEO database, and to obtain the differential expressed genes from GSE data.
 25 | 
 26 | ## A little background of GEO
 27 | 
 28 | The Gene Expression Omnibus (GEO) is a data repository hosted by the National Center for Biotechnology Information (NCBI). NCBI contains all publicly available nucleotide and protein sequences. Presently, all records in GenBank NCBI are generated from direct submission to the DNA sequence databases from the original authors, who volunteer their records to make the data publicly available or do so as part of the publication process. The NCBI GEO is intended to house different types of expression data, covering all type of sequencing data in both raw and processed formats.
 29 | 
 30 | ## Example here
 31 | 
 32 | Here, GSE63477, which is a microarray data, will be analysed. It contains an expression profile of prostate cancer cells (LNCaP) after treatment of cabazitaxel or docetaxel for 16 hr. You may read the details in NCBI GEO, under the “overall design” section. Or for better understanding, it’s always good if we read the original paper...link here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi 
 33 | 
 34 | ***
 35 | # Using GEOquery to obtain microarray data
 36 | 
 37 | First, set the working directory.
 38 | 
 39 | The GEOquery package allows you to access the data from GEO. Depending on your needs, you can download only the processed data and metadata provided by the depositor. In some cases, you may want to download the raw data as well, if it was provided by the depositor. 
 40 | 
 41 | The function to download a GEO dataset is ‘getGEO’ from the ‘GEOquery’ package.Check how many platforms used for the GSE data, usually there will only be one platform. We set the first object in the list and gse now is an expressionSet. You can see that it contains assayData, phenoData, feature etc.
 42 | 
 43 | We can have a look at the sample information, gene annotation, and the expression data. This allow us to have a rough idea of the information storing in this expressionSet.
 44 | 
 45 | ```{R}
 46 | getwd()
 47 | setwd("C:/Users/Lynn/Documents/R_GEOdata")
 48 | 
 49 | ###https://sbc.shef.ac.uk/geo_tutorial/tutorial.nb.html#Introduction
 50 | #----import the data------------------------------------
 51 | library(GEOquery)
 52 | my_id <- "GSE63477"
 53 | gse <- getGEO(my_id)
 54 | 
 55 | ## check how many platforms used
 56 | length(gse)
 57 | gse <-gse[[1]]
 58 | gse
 59 | 
 60 | pData(gse) ## print the sample information
 61 | fData(gse) ## print the gene annotation
 62 | exprs(gse)[1,] ## print the expression data
 63 | ```
 64 | 
 65 | # Check the normalisation and scales used
 66 | 
 67 | We can use this command to check the normalization method, to see if the data has already processed. So this expression data was RMA normalized and filtered to remove low-expressing genes. RMA means Robust Multiarray Average, it is the most common method to determine probeset expression level for Affymetrix arrays.
 68 | 
 69 | The ‘summary’ function can then be used to print the distributions of expression levels, if the data has been log transformed, typically in the range of 0 to 16. Hmm...the values are quite big and go beyond 16. It’s quite weird because RMA should already log2 transform the data at the last step, but well, let’s do it on our own and move to the next step. For a more careful analysis, we can try to run the raw data of this dataset again, by applying RMA normalization on our own, to see if there is any difference.
 70 | 
 71 | Anyway, here, let’s perform a log2 transformation. We may check the summary of expression level again. And draw a boxplot. We can see that the distributions of each sample are highly similar, which means the data have been normalised.
 72 | 
 73 | ```{R}
 74 | pData(gse)$data_processing[1]
 75 | # For visualisation and statistical analysis, we will inspect the data to 
 76 | # discover what scale the data are presented in. The methods we will use assume 
 77 | # the data are on a log2 scale; typically in the range of 0 to 16.
 78 | 
 79 | ## have a look on the expression value
 80 | summary(exprs(gse))
 81 | # From this output we clearly see that the values go beyond 16, 
 82 | # so we need to perform a log2 transformation.
 83 | exprs(gse) <- log2(exprs(gse))
 84 | 
 85 | # check again the summary
 86 | summary(exprs(gse))
 87 | 
 88 | boxplot(exprs(gse),outline=F)
 89 | ```
 90 | 
 91 | # Inspect the clinical variables
 92 | 
 93 | Now we try to look into the pData for the elements that we need for the analysis. We want to know the sample name, whether it is treatment or control...in this dataset, the info is stored in the column of 'characteristics_ch1.1'.
 94 | 
 95 | We can use the select function to subset the column of interest. It will be useful also to rename the column to something more shorter and easier. 
 96 | 
 97 | To make a column of simplified group names for each sample, ‘Stringr’ is helpful. Two new columns are created, named “group” and "serum". The function ‘str_detect’ is to detect the presence of the words, and then fill the row accordingly. It totally depends on your dataset to make these necessary categories in the new columns, just modify these commands for your dataset of interest.
 98 | 
 99 | ```{R}
100 | 
101 | library(dplyr)
102 | sampleInfo <- pData(gse)
103 | head(sampleInfo)
104 | 
105 | table(sampleInfo$characteristics_ch1.1)
106 | 
107 | #Let's pick just those columns that seem to contain factors we might 
108 | #need for the analysis.
109 | sampleInfo <- select(sampleInfo, characteristics_ch1.1)
110 | 
111 | ## Optionally, rename to more convenient column names
112 | sampleInfo <- rename(sampleInfo, sample = characteristics_ch1.1)
113 | 
114 | head(sampleInfo)
115 | dim(sampleInfo)
116 | sampleInfo$sample
117 | 
118 | library(stringr)
119 | sampleInfo$group <- ""
120 | for(i in 1:nrow(sampleInfo)){
121 |   if(str_detect(sampleInfo$sample[i], "CTRL") && str_detect(sampleInfo$sample[i], "full"))
122 |   {sampleInfo$group[i] <- "Conf"}
123 |   
124 |   if(str_detect(sampleInfo$sample[i], "CTRL") && str_detect(sampleInfo$sample[i], "dextran"))
125 |   {sampleInfo$group[i] <- "Cond"}
126 |   
127 |   if(str_detect(sampleInfo$sample[i], "cabazitaxel") && str_detect(sampleInfo$sample[i], "full"))
128 |   {sampleInfo$group[i] <- "cabazitaxelf"}
129 |   
130 |   if(str_detect(sampleInfo$sample[i], "cabazitaxel") && str_detect(sampleInfo$sample[i], "dextran"))
131 |   {sampleInfo$group[i] <- "cabazitaxeld"}
132 |   
133 |   if(str_detect(sampleInfo$sample[i], "docetaxel") && str_detect(sampleInfo$sample[i], "full"))
134 |   {sampleInfo$group[i] <- "docetaxelf"}
135 |   
136 |   if(str_detect(sampleInfo$sample[i], "docetaxel") && str_detect(sampleInfo$sample[i], "dextran"))
137 |   {sampleInfo$group[i] <- "docetaxeld"}
138 | }
139 | 
140 | sampleInfo 
141 | 
142 | sampleInfo$serum <- ""
143 | for(i in 1:nrow(sampleInfo)){
144 |   if(str_detect(sampleInfo$sample[i], "dextran"))
145 |   {sampleInfo$serum[i] <- "dextran"}
146 |   
147 |   if(str_detect(sampleInfo$sample[i], "full"))
148 |   {sampleInfo$serum[i] <- "full_serum"}
149 |  
150 | }
151 | 
152 | sampleInfo <- sampleInfo[,-1]
153 | sampleInfo
154 | ```
155 | 
156 | # Sample clustering and Principal Components Analaysis
157 | 
158 | We can visualize the correlations between the samples by hierarchical clustering.
159 | 
160 | The function ‘cor’ can calculate the correlation on the scale of 0 to 1, in a pairwise fashion between all samples, then visualise on a heatmap. There are many ways to create heatmaps in R, here I use ‘pheatmap’, the only argument it requires is a matrix of numeric values.
161 | 
162 | We can add more sample info onto the plot to get a better pic of the group and clustering. Here, we make use of the 'sampleInfo' file that was created earlier, to match with the columns of the correlation matrix.
163 | 
164 | ```{R}
165 | 
166 | library(pheatmap)
167 | ## argument use="c" stops an error if there are any missing data points
168 | 
169 | corMatrix <- cor(exprs(gse),use="c")
170 | pheatmap(corMatrix)   
171 | 
172 | ## Print the rownames of the sample information and check it matches the correlation matrix
173 | 
174 | rownames(sampleInfo)
175 | colnames(corMatrix)
176 | 
177 | ## If not, force the rownames to match the columns
178 | #rownames(sampleInfo) <- colnames(corMatrix)
179 | 
180 | pheatmap(corMatrix, annotation_col= sampleInfo)
181 | ```
182 | 
183 | Another way is to use Principal component analysis (PCA). It has to note that the data has to be transposed, so that the genelist is in the column, while rownames are the samples, so the PCA process will not run out of the memory in the oher way round.
184 | 
185 | Let’s add labels to plot the results, here, we use the ‘ggplots2’ package, while the ‘ggrepel’ package is used to position the text labels more cleverly so they can be read. Here we can see that the samples are divided into two groups based on the serum treatment types.
186 | 
187 | ```{R}
188 | #make PCA
189 | library(ggplot2)
190 | library(ggrepel)
191 | ## MAKE SURE TO TRANSPOSE THE EXPRESSION MATRIX
192 | 
193 | pca <- prcomp(t(exprs(gse)))
194 | 
195 | ## Join the PCs to the sample information
196 | cbind(sampleInfo, pca$x) %>% 
197 |   ggplot(aes(x = PC1, y=PC2, col=group, label=paste("",group))) + geom_point() + geom_text_repel()
198 | ```
199 | 
200 | # Differential expression analysis
201 | 
202 | In this section, we use the limma package to perform differential expressions. Limma stands for “Linear models for microarray”. Here, we need to tell limma what sample groups we want to compare. I choose sampleInfo$group. A design matrix will be created, this is a matrix of 0 and 1, one row for each sample and one column for each sample group. 
203 | 
204 | We can rename the column names so that it is easier to see.
205 | 
206 | Now, let’s check if the expression data contain any lowly-expressed genes, this will affect the quality of DE analysis. A big problem in doing statistical analysis like limma is the inference of type 1 statistical errors, also called false positive. One simple way to reduce the possibility for type 1 errors is to do fewer comparisons, by filtering the data. For example, we know that not all genes are expressed in all tissues and many genes will not be expressed in any sample. As a result, in DGE analysis, it makes sense to remove the genes that are likely not expressed at all. 
207 | 
208 | It is quite subjective how one defines a gene being expressed, here, I follow the tutorial, to make the cut off at the median of the expression values, which means to consider around 50% of the genes will not be expressed. Keep those expressed genes if they are present in more than 2 samples.
209 | 
210 | We can see that around half of the genes are not qualified as an “expressed” gene here, which makes sense, bcoz our cut-off is the median value.
211 | 
212 | ```{R}
213 | library(limma)
214 | design <- model.matrix(~0 + sampleInfo$group)
215 | design
216 | 
217 | ## the column names are a bit ugly, so we will rename
218 | colnames(design) <- c("Cabazitaxeld","Cabazitaxelf","Cond","Conf","Docetaxeld","Docetaxelf")
219 | 
220 | design
221 | 
222 | ## calculate median expression level
223 | cutoff <- median(exprs(gse))
224 | 
225 | ## TRUE or FALSE for whether each gene is "expressed" in each sample
226 | is_expressed <- exprs(gse) > cutoff
227 | 
228 | ## Identify genes expressed in more than 2 samples
229 | 
230 | keep <- rowSums(is_expressed) > 3
231 | 
232 | ## check how many genes are removed / retained.
233 | table(keep)
234 | 
235 | ## subset to just those expressed genes
236 | gse <- gse[keep,]
237 | ```
238 | 
239 | Here there is a little extra step to find out the outliers. This has to be done carefully so the filtered data won't be too biased. We calculate ‘weights’ to define the reliability of each sample. The ‘arrayweights’ function will assign a score to each sample, with a value of 1 implying equal weight. Samples with score less than 1 are down-weighed, or else up-weighed. 
240 | 
241 | 
242 | ```{R}
243 | # coping with outliers
244 | ## calculate relative array weights
245 | aw <- arrayWeights(exprs(gse),design)
246 | aw
247 | ```
248 | 
249 | Now we have a design matrix, we need to estimate the coefficients. For this design, we will essentially average the replicate arrays for each sample level. In addition, we will calculate standard deviations for each gene, and the average intensity for the genes across all microarrays.
250 | 
251 | We are ready to tell limma which pairwise contrasts that we want to make. For this experiment, we are going to contrast treatment (there are two types of texane drugs) and control in each serum type. So there are 4 contrasts to specify.
252 | 
253 | To do the statistical comparisons, Limma uses Bayesian statistics to minimize type 1 error. The eBayes function performs the tests. To summarize the results of the statistical test, 'topTable' will adjust the p-values and return the top genes that meet the cutoffs that you supply as arguments; while 'decideTests' will make calls for DEGs by adjusting the p-values and applying a logFC cutoff similar to topTable.
254 | 
255 | ```{R}
256 | ## Fitting the coefficients
257 | fit <- lmFit(exprs(gse), design,
258 |              weights = aw)
259 | 
260 | head(fit$coefficients)
261 | 
262 | ## Making comparisons between samples, can define multiple contrasts
263 | contrasts <- makeContrasts(Docetaxeld - Cond, Cabazitaxeld - Cond, Docetaxelf - Conf, Cabazitaxelf - Conf, levels = design)
264 | 
265 | fit2 <- contrasts.fit(fit, contrasts)
266 | fit2 <- eBayes(fit2)
267 | 
268 | 
269 | topTable(fit2)
270 | topTable1 <- topTable(fit2, coef=1)
271 | topTable2 <- topTable(fit2, coef=2)
272 | topTable3 <- topTable(fit2, coef=3)
273 | topTable4 <- topTable(fit2, coef=4)
274 | 
275 | #if we want to know how many genes are differentially expressed overall, we can use the decideTest function.
276 | summary(decideTests(fit2))
277 | table(decideTests(fit2))
278 | ```
279 | 
280 | # Further visualization with gene annotation
281 | 
282 | Now we want to know the gene name associated with the gene ID. The annotation data can be retrieved with the ‘fData’ function. Let’s select the ID, GB_ACC, this is genbank accession ID. Add into fit2 table.
283 | 
284 | The “Volcano Plot” function is a common way of visualising the results of a DE analysis. The x axis shows the log-fold change and the y axis is some measure of statistical significance, which in this case is the log-odds, or “B” statistic. We can also change the color of those genes with p value cutoff more than 0.05, and fold change cut off more than 1.
285 | 
286 | ```{R}
287 | 
288 | anno <- fData(gse)
289 | head(anno)
290 | 
291 | anno <- select(anno,ID,GB_ACC)
292 | fit2$genes <- anno
293 | 
294 | topTable(fit2)
295 | 
296 | ## Create volcano plot
297 | full_results1 <- topTable(fit2, coef=1, number=Inf)
298 | library(ggplot2)
299 | ggplot(full_results1,aes(x = logFC, y=B)) + geom_point()
300 | 
301 | ## change according to your needs
302 | p_cutoff <- 0.05
303 | fc_cutoff <- 1
304 | 
305 | 
306 | full_results1 %>% 
307 |   mutate(Significant = P.Value < p_cutoff, abs(logFC) > fc_cutoff ) %>% 
308 |   ggplot(aes(x = logFC, y = B, col=Significant)) + geom_point()
309 | ```
310 | 
311 | # Further visualization of selected gene
312 | 
313 | I think at this point, we are quite clear about data structure of GSE data. It has an experiment data, pData; the expression data, exprs; and also annotation data, fData. And we have learned how to check the expression data, normalize them, and perform differential expression analysis. 
314 | 
315 | Now, with the differential expression gene tables, there are some downstream analyses that we can continue, such as to export a full table of DE genes, to generate a heatmap for your selected genes, get the gene list for a particular pathway, or survival analysis (but this is only for those clinical data).
316 | 
317 | Here, I just want to look into the fold change data of a selected gene, whether it is significantly differential expressed or not. 
318 | 
319 | ```{R}
320 | 
321 | ## Get the results for particular gene of interest
322 | #GB_ACC for Nkx3-1 is NM_001256339 or NM_006167
323 | ##no NM_001256339 in this data
324 | full_results2 <- topTable(fit2, coef=2, number=Inf)
325 | full_results3 <- topTable(fit2, coef=3, number=Inf)
326 | full_results4 <- topTable(fit2, coef=4, number=Inf)
327 | filter(full_results1, GB_ACC == "NM_006167")
328 | filter(full_results2, GB_ACC == "NM_006167")
329 | filter(full_results3, GB_ACC == "NM_006167")
330 | filter(full_results4, GB_ACC == "NM_006167")
331 | ```
332 | 
333 | That’s all for the walk-through, thanks for reading, I hope you have learned something new here.
334 | 
335 | # Acknowlegdement
336 | Many thanks to the following tutorials made publicly available:
337 | 
338 | 1. Introduction to microarray analysis GSE15947, by Department of Statistics, Purdue Univrsity https://www.stat.purdue.edu/bigtap/online/docs/Introduction_to_Microarray_Analysis_GSE15947.html
339 | 
340 | 2. Mark Dunning, 2020, GEO tutorial, by Sheffield Bioinformatics Core https://sbc.shef.ac.uk/geo_tutorial/tutorial.nb.html#Further_processing_and_visualisation_of_DE_results 
341 | 
342 | 


--------------------------------------------------------------------------------
/Info.txt:
--------------------------------------------------------------------------------
1 | # How-to-analyze-GEO-microarray-data
2 | GSE analysis for microarray data, for the tutorial as shown in https://www.youtube.com/watch?v=JQ24T9fpXvg&amp;t=947s
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | -----
   2 | 
   3 | # Introduction
   4 | 
   5 | This R script is to demonstrate the steps to download GSE data from NCBI
   6 | GEO database, and to obtain the differential expressed genes from GSE
   7 | data.
   8 | 
   9 | ## A little background of GEO
  10 | 
  11 | The Gene Expression Omnibus (GEO) is a data repository hosted by the
  12 | National Center for Biotechnology Information (NCBI). NCBI contains all
  13 | publicly available nucleotide and protein sequences. Presently, all
  14 | records in GenBank NCBI are generated from direct submission to the DNA
  15 | sequence databases from the original authors, who volunteer their
  16 | records to make the data publicly available or do so as part of the
  17 | publication process. The NCBI GEO is intended to house different types
  18 | of expression data, covering all type of sequencing data in both raw and
  19 | processed formats.
  20 | 
  21 | ## Example here
  22 | 
  23 | Here, GSE63477, which is a microarray data, will be analysed. It
  24 | contains an expression profile of prostate cancer cells (LNCaP) after
  25 | treatment of cabazitaxel or docetaxel for 16 hr. You may read the
  26 | details in NCBI GEO, under the “overall design” section. Or for better
  27 | understanding, it’s always good if we read the original paper…link here:
  28 | <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi>
  29 | 
  30 | -----
  31 | 
  32 | # Using GEOquery to obtain microarray data
  33 | 
  34 | First, set the working directory.
  35 | 
  36 | The GEOquery package allows you to access the data from GEO. Depending
  37 | on your needs, you can download only the processed data and metadata
  38 | provided by the depositor. In some cases, you may want to download the
  39 | raw data as well, if it was provided by the depositor.
  40 | 
  41 | The function to download a GEO dataset is ‘getGEO’ from the ‘GEOquery’
  42 | package.Check how many platforms used for the GSE data, usually there
  43 | will only be one platform. We set the first object in the list and gse
  44 | now is an expressionSet. You can see that it contains assayData,
  45 | phenoData, feature etc.
  46 | 
  47 | We can have a look at the sample information, gene annotation, and the
  48 | expression data. This allow us to have a rough idea of the information
  49 | storing in this expressionSet.
  50 | 
  51 | ``` r
  52 | getwd()
  53 | ```
  54 | 
  55 |     ## [1] "C:/Users/Lynn/Documents/R_GEOdata"
  56 | 
  57 | ``` r
  58 | setwd("C:/Users/Lynn/Documents/R_GEOdata")
  59 | 
  60 | ###https://sbc.shef.ac.uk/geo_tutorial/tutorial.nb.html#Introduction
  61 | #----import the data------------------------------------
  62 | library(GEOquery)
  63 | ```
  64 | 
  65 |     ## Loading required package: Biobase
  66 | 
  67 |     ## Loading required package: BiocGenerics
  68 | 
  69 |     ## Loading required package: parallel
  70 | 
  71 |     ## 
  72 |     ## Attaching package: 'BiocGenerics'
  73 | 
  74 |     ## The following objects are masked from 'package:parallel':
  75 |     ## 
  76 |     ##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
  77 |     ##     clusterExport, clusterMap, parApply, parCapply, parLapply,
  78 |     ##     parLapplyLB, parRapply, parSapply, parSapplyLB
  79 | 
  80 |     ## The following objects are masked from 'package:stats':
  81 |     ## 
  82 |     ##     IQR, mad, sd, var, xtabs
  83 | 
  84 |     ## The following objects are masked from 'package:base':
  85 |     ## 
  86 |     ##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
  87 |     ##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
  88 |     ##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
  89 |     ##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
  90 |     ##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
  91 |     ##     union, unique, unsplit, which.max, which.min
  92 | 
  93 |     ## Welcome to Bioconductor
  94 |     ## 
  95 |     ##     Vignettes contain introductory material; view with
  96 |     ##     'browseVignettes()'. To cite Bioconductor, see
  97 |     ##     'citation("Biobase")', and for packages 'citation("pkgname")'.
  98 | 
  99 |     ## Setting options('download.file.method.GEOquery'='auto')
 100 | 
 101 |     ## Setting options('GEOquery.inmemory.gpl'=FALSE)
 102 | 
 103 | ``` r
 104 | my_id <- "GSE63477"
 105 | gse <- getGEO(my_id)
 106 | ```
 107 | 
 108 |     ## Found 1 file(s)
 109 | 
 110 |     ## GSE63477_series_matrix.txt.gz
 111 | 
 112 |     ## Parsed with column specification:
 113 |     ## cols(
 114 |     ##   ID_REF = col_double(),
 115 |     ##   GSM1550559 = col_double(),
 116 |     ##   GSM1550560 = col_double(),
 117 |     ##   GSM1550561 = col_double(),
 118 |     ##   GSM1550562 = col_double(),
 119 |     ##   GSM1550563 = col_double(),
 120 |     ##   GSM1550564 = col_double(),
 121 |     ##   GSM1550565 = col_double(),
 122 |     ##   GSM1550566 = col_double(),
 123 |     ##   GSM1550567 = col_double(),
 124 |     ##   GSM1550568 = col_double(),
 125 |     ##   GSM1550569 = col_double(),
 126 |     ##   GSM1550570 = col_double()
 127 |     ## )
 128 | 
 129 |     ## File stored at:
 130 | 
 131 |     ## C:\Users\Lynn\AppData\Local\Temp\RtmpcbmFsU/GPL16686.soft
 132 | 
 133 |     ## Warning: 190 parsing failures.
 134 |     ##   row col expected         actual         file
 135 |     ## 53792  ID a double AFFX-BioB-3_at literal data
 136 |     ## 53793  ID a double AFFX-BioB-3_st literal data
 137 |     ## 53794  ID a double AFFX-BioB-5_at literal data
 138 |     ## 53795  ID a double AFFX-BioB-5_st literal data
 139 |     ## 53796  ID a double AFFX-BioB-M_at literal data
 140 |     ## ..... ... ........ .............. ............
 141 |     ## See problems(...) for more details.
 142 | 
 143 | ``` r
 144 | ## check how many platforms used
 145 | length(gse)
 146 | ```
 147 | 
 148 |     ## [1] 1
 149 | 
 150 | ``` r
 151 | gse <-gse[[1]]
 152 | gse
 153 | ```
 154 | 
 155 |     ## ExpressionSet (storageMode: lockedEnvironment)
 156 |     ## assayData: 44629 features, 12 samples 
 157 |     ##   element names: exprs 
 158 |     ## protocolData: none
 159 |     ## phenoData
 160 |     ##   sampleNames: GSM1550559 GSM1550560 ... GSM1550570 (12 total)
 161 |     ##   varLabels: title geo_accession ... treatment:ch1 (35 total)
 162 |     ##   varMetadata: labelDescription
 163 |     ## featureData
 164 |     ##   featureNames: 16657436 16657440 ... 17118478 (44629 total)
 165 |     ##   fvarLabels: ID RANGE_STRAND ... RANGE_GB (8 total)
 166 |     ##   fvarMetadata: Column Description labelDescription
 167 |     ## experimentData: use 'experimentData(object)'
 168 |     ## Annotation: GPL16686
 169 | 
 170 | ``` r
 171 | pData(gse)[1:2,] ## print the sample information
 172 | ```
 173 | 
 174 |     ##                                                                        title
 175 |     ## GSM1550559 LNCaP CTRL treated in charcoal dextran treated serum, replicate 1
 176 |     ## GSM1550560 LNCaP CTRL treated in charcoal dextran treated serum, replicate 2
 177 |     ##            geo_accession                status submission_date last_update_date
 178 |     ## GSM1550559    GSM1550559 Public on Jan 01 2015     Nov 19 2014      Jan 01 2015
 179 |     ## GSM1550560    GSM1550560 Public on Jan 01 2015     Nov 19 2014      Jan 01 2015
 180 |     ##            type channel_count source_name_ch1 organism_ch1 characteristics_ch1
 181 |     ## GSM1550559  RNA             1           LNCaP Homo sapiens    cell line: LNCaP
 182 |     ## GSM1550560  RNA             1           LNCaP Homo sapiens    cell line: LNCaP
 183 |     ##                                                characteristics_ch1.1
 184 |     ## GSM1550559 treatment: CTRL treated in charcoal dextran treated serum
 185 |     ## GSM1550560 treatment: CTRL treated in charcoal dextran treated serum
 186 |     ##                                                      treatment_protocol_ch1
 187 |     ## GSM1550559 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH)
 188 |     ## GSM1550560 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH)
 189 |     ##                            growth_protocol_ch1 molecule_ch1
 190 |     ## GSM1550559 Cells treated at ca. 80% confluency    total RNA
 191 |     ## GSM1550560 Cells treated at ca. 80% confluency    total RNA
 192 |     ##                                                                                                                                              extract_protocol_ch1
 193 |     ## GSM1550559 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc)
 194 |     ## GSM1550560 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc)
 195 |     ##            label_ch1
 196 |     ## GSM1550559    biotin
 197 |     ## GSM1550560    biotin
 198 |     ##                                                                                                                      label_protocol_ch1
 199 |     ## GSM1550559 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc)
 200 |     ## GSM1550560 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc)
 201 |     ##            taxid_ch1
 202 |     ## GSM1550559      9606
 203 |     ## GSM1550560      9606
 204 |     ##                                                                                                                                                                                                                                                hyb_protocol
 205 |     ## GSM1550559 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h.
 206 |     ## GSM1550560 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h.
 207 |     ##                                   scan_protocol
 208 |     ## GSM1550559 Affymetrix Gene ChIP Scanner 3000 7G
 209 |     ## GSM1550560 Affymetrix Gene ChIP Scanner 3000 7G
 210 |     ##                                                                                                                                    data_processing
 211 |     ## GSM1550559 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes.
 212 |     ## GSM1550560 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes.
 213 |     ##            data_processing.1 data_processing.2 platform_id   contact_name
 214 |     ## GSM1550559 HuGene-2_0-st.pgf HuGene-2_0-st.mps    GPL16686 Karen,,Knudsen
 215 |     ## GSM1550560 HuGene-2_0-st.pgf HuGene-2_0-st.mps    GPL16686 Karen,,Knudsen
 216 |     ##                                             contact_institute
 217 |     ## GSM1550559 Thomas Jefferson University - Kimmel Cancer Center
 218 |     ## GSM1550560 Thomas Jefferson University - Kimmel Cancer Center
 219 |     ##                     contact_address contact_city contact_state
 220 |     ## GSM1550559 233 S 10th St, BLSB 1008 Philadelphia  Pennsylvania
 221 |     ## GSM1550560 233 S 10th St, BLSB 1008 Philadelphia  Pennsylvania
 222 |     ##            contact_zip/postal_code contact_country
 223 |     ## GSM1550559                   19107             USA
 224 |     ## GSM1550560                   19107             USA
 225 |     ##                                                                                               supplementary_file
 226 |     ## GSM1550559 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550559/suppl/GSM1550559_01_LN_CDT-CTRL_1.CEL.gz
 227 |     ## GSM1550560 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550560/suppl/GSM1550560_01_LN_CTS-CTRL_2.CEL.gz
 228 |     ##            data_row_count cell line:ch1
 229 |     ## GSM1550559          44629         LNCaP
 230 |     ## GSM1550560          44629         LNCaP
 231 |     ##                                             treatment:ch1
 232 |     ## GSM1550559 CTRL treated in charcoal dextran treated serum
 233 |     ## GSM1550560 CTRL treated in charcoal dextran treated serum
 234 | 
 235 | ``` r
 236 | fData(gse)[1,] ## print the gene annotation
 237 | ```
 238 | 
 239 |     ##                ID RANGE_STRAND RANGE_START RANGE_END total_probes    GB_ACC
 240 |     ## 16657436 16657436            +       12190     13639           25 NR_046018
 241 |     ##                   SPOT_ID     RANGE_GB
 242 |     ## 16657436 chr1:12190-13639 NC_000001.10
 243 | 
 244 | ``` r
 245 | exprs(gse)[1,] ## print the expression data
 246 | ```
 247 | 
 248 |     ## GSM1550559 GSM1550560 GSM1550561 GSM1550562 GSM1550563 GSM1550564 GSM1550565 
 249 |     ##   24.63215   21.96198   24.36674   24.28032   24.82574   22.72258   25.76430 
 250 |     ## GSM1550566 GSM1550567 GSM1550568 GSM1550569 GSM1550570 
 251 |     ##   23.03947   24.45452   23.74868   23.77131   21.67176
 252 | 
 253 | # Check the normalisation and scales used
 254 | 
 255 | We can use this command to check the normalization method, to see if the
 256 | data has already processed. So this expression data was RMA normalized
 257 | and filtered to remove low-expressing genes. RMA means Robust Multiarray
 258 | Average, it is the most common method to determine probeset expression
 259 | level for Affymetrix arrays.
 260 | 
 261 | The ‘summary’ function can then be used to print the distributions of
 262 | expression levels, if the data has been log transformed, typically in
 263 | the range of 0 to 16. Hmm…the values are quite big and go beyond 16.
 264 | It’s quite weird because RMA should already log2 transform the data at
 265 | the last step, but well, let’s do it on our own and move to the next
 266 | step. For a more careful analysis, we can try to run the raw data of
 267 | this dataset again, by applying RMA normalization on our own, to see if
 268 | there is any difference.
 269 | 
 270 | Anyway, here, let’s perform a log2 transformation. We may check the
 271 | summary of expression level again. And draw a boxplot. We can see that
 272 | the distributions of each sample are highly similar, which means the
 273 | data have been normalised.
 274 | 
 275 | ``` r
 276 | pData(gse)$data_processing[1]
 277 | ```
 278 | 
 279 |     ## [1] "Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes."
 280 | 
 281 | ``` r
 282 | # For visualisation and statistical analysis, we will inspect the data to 
 283 | # discover what scale the data are presented in. The methods we will use assume 
 284 | # the data are on a log2 scale; typically in the range of 0 to 16.
 285 | 
 286 | ## have a look on the expression value
 287 | summary(exprs(gse))
 288 | ```
 289 | 
 290 |     ##    GSM1550559        GSM1550560        GSM1550561        GSM1550562     
 291 |     ##  Min.   :  17.42   Min.   :  17.49   Min.   :  17.54   Min.   :  17.44  
 292 |     ##  1st Qu.:  19.59   1st Qu.:  19.53   1st Qu.:  19.59   1st Qu.:  19.45  
 293 |     ##  Median :  22.29   Median :  22.34   Median :  22.39   Median :  22.43  
 294 |     ##  Mean   :  48.09   Mean   :  48.51   Mean   :  48.08   Mean   :  48.91  
 295 |     ##  3rd Qu.:  33.45   3rd Qu.:  33.90   3rd Qu.:  33.70   3rd Qu.:  33.92  
 296 |     ##  Max.   :5889.83   Max.   :6043.74   Max.   :5907.55   Max.   :6087.95  
 297 |     ##    GSM1550563        GSM1550564        GSM1550565        GSM1550566     
 298 |     ##  Min.   :  17.48   Min.   :  17.47   Min.   :  17.43   Min.   :  17.49  
 299 |     ##  1st Qu.:  19.48   1st Qu.:  19.51   1st Qu.:  19.59   1st Qu.:  19.54  
 300 |     ##  Median :  22.19   Median :  22.22   Median :  22.38   Median :  22.29  
 301 |     ##  Mean   :  49.16   Mean   :  48.72   Mean   :  48.29   Mean   :  48.83  
 302 |     ##  3rd Qu.:  33.70   3rd Qu.:  33.84   3rd Qu.:  33.49   3rd Qu.:  33.85  
 303 |     ##  Max.   :5991.22   Max.   :5827.09   Max.   :5704.80   Max.   :5938.88  
 304 |     ##    GSM1550567        GSM1550568        GSM1550569        GSM1550570     
 305 |     ##  Min.   :  17.49   Min.   :  17.32   Min.   :  17.38   Min.   :  17.38  
 306 |     ##  1st Qu.:  19.54   1st Qu.:  19.58   1st Qu.:  19.41   1st Qu.:  19.52  
 307 |     ##  Median :  22.36   Median :  22.35   Median :  22.31   Median :  22.27  
 308 |     ##  Mean   :  48.72   Mean   :  48.31   Mean   :  49.31   Mean   :  48.94  
 309 |     ##  3rd Qu.:  33.62   3rd Qu.:  33.58   3rd Qu.:  33.82   3rd Qu.:  33.71  
 310 |     ##  Max.   :6140.01   Max.   :6066.52   Max.   :6307.27   Max.   :5844.71
 311 | 
 312 | ``` r
 313 | # From this output we clearly see that the values go beyond 16, 
 314 | # so we need to perform a log2 transformation.
 315 | exprs(gse) <- log2(exprs(gse))
 316 | 
 317 | # check again the summary
 318 | summary(exprs(gse))
 319 | ```
 320 | 
 321 |     ##    GSM1550559       GSM1550560       GSM1550561       GSM1550562    
 322 |     ##  Min.   : 4.122   Min.   : 4.128   Min.   : 4.133   Min.   : 4.124  
 323 |     ##  1st Qu.: 4.292   1st Qu.: 4.288   1st Qu.: 4.292   1st Qu.: 4.282  
 324 |     ##  Median : 4.479   Median : 4.482   Median : 4.485   Median : 4.488  
 325 |     ##  Mean   : 4.887   Mean   : 4.892   Mean   : 4.887   Mean   : 4.890  
 326 |     ##  3rd Qu.: 5.064   3rd Qu.: 5.083   3rd Qu.: 5.075   3rd Qu.: 5.084  
 327 |     ##  Max.   :12.524   Max.   :12.561   Max.   :12.528   Max.   :12.572  
 328 |     ##    GSM1550563       GSM1550564       GSM1550565       GSM1550566    
 329 |     ##  Min.   : 4.128   Min.   : 4.127   Min.   : 4.123   Min.   : 4.128  
 330 |     ##  1st Qu.: 4.284   1st Qu.: 4.286   1st Qu.: 4.292   1st Qu.: 4.288  
 331 |     ##  Median : 4.472   Median : 4.474   Median : 4.484   Median : 4.478  
 332 |     ##  Mean   : 4.893   Mean   : 4.892   Mean   : 4.887   Mean   : 4.892  
 333 |     ##  3rd Qu.: 5.075   3rd Qu.: 5.081   3rd Qu.: 5.066   3rd Qu.: 5.081  
 334 |     ##  Max.   :12.549   Max.   :12.509   Max.   :12.478   Max.   :12.536  
 335 |     ##    GSM1550567       GSM1550568       GSM1550569       GSM1550570    
 336 |     ##  Min.   : 4.128   Min.   : 4.114   Min.   : 4.119   Min.   : 4.120  
 337 |     ##  1st Qu.: 4.288   1st Qu.: 4.291   1st Qu.: 4.279   1st Qu.: 4.287  
 338 |     ##  Median : 4.483   Median : 4.482   Median : 4.479   Median : 4.477  
 339 |     ##  Mean   : 4.891   Mean   : 4.888   Mean   : 4.891   Mean   : 4.892  
 340 |     ##  3rd Qu.: 5.071   3rd Qu.: 5.070   3rd Qu.: 5.080   3rd Qu.: 5.075  
 341 |     ##  Max.   :12.584   Max.   :12.567   Max.   :12.623   Max.   :12.513
 342 | 
 343 | ``` r
 344 | boxplot(exprs(gse),outline=F)
 345 | ```
 346 | 
 347 | ![](GSE_analysis_microarray_files/figure-gfm/unnamed-chunk-2-1.png)<!-- -->
 348 | 
 349 | # Inspect the clinical variables
 350 | 
 351 | Now we try to look into the pData for the elements that we need for the
 352 | analysis. We want to know the sample name, whether it is treatment or
 353 | control…in this dataset, the info is stored in the column of
 354 | ‘characteristics\_ch1.1’.
 355 | 
 356 | We can use the select function to subset the column of interest. It will
 357 | be useful also to rename the column to something more shorter and
 358 | easier.
 359 | 
 360 | To make a column of simplified group names for each sample, ‘Stringr’ is
 361 | helpful. Two new columns are created, named “group” and “serum”. The
 362 | function ‘str\_detect’ is to detect the presence of the words, and then
 363 | fill the row accordingly. It totally depends on your dataset to make
 364 | these necessary categories in the new columns, just modify these
 365 | commands for your dataset of interest.
 366 | 
 367 | ``` r
 368 | library(dplyr)
 369 | ```
 370 | 
 371 |     ## 
 372 |     ## Attaching package: 'dplyr'
 373 | 
 374 |     ## The following object is masked from 'package:Biobase':
 375 |     ## 
 376 |     ##     combine
 377 | 
 378 |     ## The following objects are masked from 'package:BiocGenerics':
 379 |     ## 
 380 |     ##     combine, intersect, setdiff, union
 381 | 
 382 |     ## The following objects are masked from 'package:stats':
 383 |     ## 
 384 |     ##     filter, lag
 385 | 
 386 |     ## The following objects are masked from 'package:base':
 387 |     ## 
 388 |     ##     intersect, setdiff, setequal, union
 389 | 
 390 | ``` r
 391 | sampleInfo <- pData(gse)
 392 | head(sampleInfo)
 393 | ```
 394 | 
 395 |     ##                                                                                         title
 396 |     ## GSM1550559                  LNCaP CTRL treated in charcoal dextran treated serum, replicate 1
 397 |     ## GSM1550560                  LNCaP CTRL treated in charcoal dextran treated serum, replicate 2
 398 |     ## GSM1550561 LNCaP 16h cabazitaxel (1nM) treated in charcoal dextran treated serum, replicate 1
 399 |     ## GSM1550562 LNCaP 16h cabazitaxel (1nM) treated in charcoal dextran treated serum, replicate 2
 400 |     ## GSM1550563   LNCaP 16h docetaxel (1nM) treated in charcoal dextran treated serum, replicate 1
 401 |     ## GSM1550564   LNCaP 16h docetaxel (1nM) treated in charcoal dextran treated serum, replicate 2
 402 |     ##            geo_accession                status submission_date last_update_date
 403 |     ## GSM1550559    GSM1550559 Public on Jan 01 2015     Nov 19 2014      Jan 01 2015
 404 |     ## GSM1550560    GSM1550560 Public on Jan 01 2015     Nov 19 2014      Jan 01 2015
 405 |     ## GSM1550561    GSM1550561 Public on Jan 01 2015     Nov 19 2014      Jan 01 2015
 406 |     ## GSM1550562    GSM1550562 Public on Jan 01 2015     Nov 19 2014      Jan 01 2015
 407 |     ## GSM1550563    GSM1550563 Public on Jan 01 2015     Nov 19 2014      Jan 01 2015
 408 |     ## GSM1550564    GSM1550564 Public on Jan 01 2015     Nov 19 2014      Jan 01 2015
 409 |     ##            type channel_count source_name_ch1 organism_ch1 characteristics_ch1
 410 |     ## GSM1550559  RNA             1           LNCaP Homo sapiens    cell line: LNCaP
 411 |     ## GSM1550560  RNA             1           LNCaP Homo sapiens    cell line: LNCaP
 412 |     ## GSM1550561  RNA             1           LNCaP Homo sapiens    cell line: LNCaP
 413 |     ## GSM1550562  RNA             1           LNCaP Homo sapiens    cell line: LNCaP
 414 |     ## GSM1550563  RNA             1           LNCaP Homo sapiens    cell line: LNCaP
 415 |     ## GSM1550564  RNA             1           LNCaP Homo sapiens    cell line: LNCaP
 416 |     ##                                                                 characteristics_ch1.1
 417 |     ## GSM1550559                  treatment: CTRL treated in charcoal dextran treated serum
 418 |     ## GSM1550560                  treatment: CTRL treated in charcoal dextran treated serum
 419 |     ## GSM1550561 treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum
 420 |     ## GSM1550562 treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum
 421 |     ## GSM1550563   treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum
 422 |     ## GSM1550564   treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum
 423 |     ##                                                      treatment_protocol_ch1
 424 |     ## GSM1550559 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH)
 425 |     ## GSM1550560 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH)
 426 |     ## GSM1550561 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH)
 427 |     ## GSM1550562 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH)
 428 |     ## GSM1550563 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH)
 429 |     ## GSM1550564 16h treatment with 1nM cabazitaxel, docetaxel, or control (EtOH)
 430 |     ##                            growth_protocol_ch1 molecule_ch1
 431 |     ## GSM1550559 Cells treated at ca. 80% confluency    total RNA
 432 |     ## GSM1550560 Cells treated at ca. 80% confluency    total RNA
 433 |     ## GSM1550561 Cells treated at ca. 80% confluency    total RNA
 434 |     ## GSM1550562 Cells treated at ca. 80% confluency    total RNA
 435 |     ## GSM1550563 Cells treated at ca. 80% confluency    total RNA
 436 |     ## GSM1550564 Cells treated at ca. 80% confluency    total RNA
 437 |     ##                                                                                                                                              extract_protocol_ch1
 438 |     ## GSM1550559 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc)
 439 |     ## GSM1550560 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc)
 440 |     ## GSM1550561 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc)
 441 |     ## GSM1550562 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc)
 442 |     ## GSM1550563 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc)
 443 |     ## GSM1550564 Standard Trizol RNA extraction, followed by cDNA amplification using the Ovation Pico WTA-system V2 RNA amplification system (NuGen Technologies, Inc)
 444 |     ##            label_ch1
 445 |     ## GSM1550559    biotin
 446 |     ## GSM1550560    biotin
 447 |     ## GSM1550561    biotin
 448 |     ## GSM1550562    biotin
 449 |     ## GSM1550563    biotin
 450 |     ## GSM1550564    biotin
 451 |     ##                                                                                                                      label_protocol_ch1
 452 |     ## GSM1550559 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc)
 453 |     ## GSM1550560 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc)
 454 |     ## GSM1550561 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc)
 455 |     ## GSM1550562 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc)
 456 |     ## GSM1550563 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc)
 457 |     ## GSM1550564 5ug cDNA was fragmanted and chemically labeled with biotin using the FL-Ovation cDNA biotin module (NuGen Technologies, Inc)
 458 |     ##            taxid_ch1
 459 |     ## GSM1550559      9606
 460 |     ## GSM1550560      9606
 461 |     ## GSM1550561      9606
 462 |     ## GSM1550562      9606
 463 |     ## GSM1550563      9606
 464 |     ## GSM1550564      9606
 465 |     ##                                                                                                                                                                                                                                                hyb_protocol
 466 |     ## GSM1550559 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h.
 467 |     ## GSM1550560 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h.
 468 |     ## GSM1550561 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h.
 469 |     ## GSM1550562 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h.
 470 |     ## GSM1550563 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h.
 471 |     ## GSM1550564 5ug cDNA in 220ul hybridization cocktail was hybridized on the Affymetrix Human Gene 2.0 ST Array in a GeneChip Hybridization Oven 645. Target denaturation was performed at 99C for 2 minutes, 45C for 5min, followed by hybridization for 18h.
 472 |     ##                                   scan_protocol
 473 |     ## GSM1550559 Affymetrix Gene ChIP Scanner 3000 7G
 474 |     ## GSM1550560 Affymetrix Gene ChIP Scanner 3000 7G
 475 |     ## GSM1550561 Affymetrix Gene ChIP Scanner 3000 7G
 476 |     ## GSM1550562 Affymetrix Gene ChIP Scanner 3000 7G
 477 |     ## GSM1550563 Affymetrix Gene ChIP Scanner 3000 7G
 478 |     ## GSM1550564 Affymetrix Gene ChIP Scanner 3000 7G
 479 |     ##                                                                                                                                    data_processing
 480 |     ## GSM1550559 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes.
 481 |     ## GSM1550560 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes.
 482 |     ## GSM1550561 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes.
 483 |     ## GSM1550562 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes.
 484 |     ## GSM1550563 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes.
 485 |     ## GSM1550564 Data were processed with GenSpring 11.5 software. The expression data were RMA normalized, and filtered to remove low-expressing genes.
 486 |     ##            data_processing.1 data_processing.2 platform_id   contact_name
 487 |     ## GSM1550559 HuGene-2_0-st.pgf HuGene-2_0-st.mps    GPL16686 Karen,,Knudsen
 488 |     ## GSM1550560 HuGene-2_0-st.pgf HuGene-2_0-st.mps    GPL16686 Karen,,Knudsen
 489 |     ## GSM1550561 HuGene-2_0-st.pgf HuGene-2_0-st.mps    GPL16686 Karen,,Knudsen
 490 |     ## GSM1550562 HuGene-2_0-st.pgf HuGene-2_0-st.mps    GPL16686 Karen,,Knudsen
 491 |     ## GSM1550563 HuGene-2_0-st.pgf HuGene-2_0-st.mps    GPL16686 Karen,,Knudsen
 492 |     ## GSM1550564 HuGene-2_0-st.pgf HuGene-2_0-st.mps    GPL16686 Karen,,Knudsen
 493 |     ##                                             contact_institute
 494 |     ## GSM1550559 Thomas Jefferson University - Kimmel Cancer Center
 495 |     ## GSM1550560 Thomas Jefferson University - Kimmel Cancer Center
 496 |     ## GSM1550561 Thomas Jefferson University - Kimmel Cancer Center
 497 |     ## GSM1550562 Thomas Jefferson University - Kimmel Cancer Center
 498 |     ## GSM1550563 Thomas Jefferson University - Kimmel Cancer Center
 499 |     ## GSM1550564 Thomas Jefferson University - Kimmel Cancer Center
 500 |     ##                     contact_address contact_city contact_state
 501 |     ## GSM1550559 233 S 10th St, BLSB 1008 Philadelphia  Pennsylvania
 502 |     ## GSM1550560 233 S 10th St, BLSB 1008 Philadelphia  Pennsylvania
 503 |     ## GSM1550561 233 S 10th St, BLSB 1008 Philadelphia  Pennsylvania
 504 |     ## GSM1550562 233 S 10th St, BLSB 1008 Philadelphia  Pennsylvania
 505 |     ## GSM1550563 233 S 10th St, BLSB 1008 Philadelphia  Pennsylvania
 506 |     ## GSM1550564 233 S 10th St, BLSB 1008 Philadelphia  Pennsylvania
 507 |     ##            contact_zip/postal_code contact_country
 508 |     ## GSM1550559                   19107             USA
 509 |     ## GSM1550560                   19107             USA
 510 |     ## GSM1550561                   19107             USA
 511 |     ## GSM1550562                   19107             USA
 512 |     ## GSM1550563                   19107             USA
 513 |     ## GSM1550564                   19107             USA
 514 |     ##                                                                                               supplementary_file
 515 |     ## GSM1550559 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550559/suppl/GSM1550559_01_LN_CDT-CTRL_1.CEL.gz
 516 |     ## GSM1550560 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550560/suppl/GSM1550560_01_LN_CTS-CTRL_2.CEL.gz
 517 |     ## GSM1550561 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550561/suppl/GSM1550561_02_LN-CDT_CBTX_2.CEL.gz
 518 |     ## GSM1550562 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550562/suppl/GSM1550562_02_LN_CDT-CBTX_1.CEL.gz
 519 |     ## GSM1550563 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550563/suppl/GSM1550563_03_LN-CDT-DCTX_1.CEL.gz
 520 |     ## GSM1550564 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1550nnn/GSM1550564/suppl/GSM1550564_03_LN-CDT-DCTX_2.CEL.gz
 521 |     ##            data_row_count cell line:ch1
 522 |     ## GSM1550559          44629         LNCaP
 523 |     ## GSM1550560          44629         LNCaP
 524 |     ## GSM1550561          44629         LNCaP
 525 |     ## GSM1550562          44629         LNCaP
 526 |     ## GSM1550563          44629         LNCaP
 527 |     ## GSM1550564          44629         LNCaP
 528 |     ##                                                              treatment:ch1
 529 |     ## GSM1550559                  CTRL treated in charcoal dextran treated serum
 530 |     ## GSM1550560                  CTRL treated in charcoal dextran treated serum
 531 |     ## GSM1550561 16h cabazitaxel (1nM) treated in charcoal dextran treated serum
 532 |     ## GSM1550562 16h cabazitaxel (1nM) treated in charcoal dextran treated serum
 533 |     ## GSM1550563   16h docetaxel (1nM) treated in charcoal dextran treated serum
 534 |     ## GSM1550564   16h docetaxel (1nM) treated in charcoal dextran treated serum
 535 | 
 536 | ``` r
 537 | table(sampleInfo$characteristics_ch1.1)
 538 | ```
 539 | 
 540 |     ## 
 541 |     ## treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum 
 542 |     ##                                                                          2 
 543 |     ##                     treatment: 16h cabazitaxel (1nM) treated in full serum 
 544 |     ##                                                                          2 
 545 |     ##   treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum 
 546 |     ##                                                                          2 
 547 |     ##                       treatment: 16h docetaxel (1nM) treated in full serum 
 548 |     ##                                                                          2 
 549 |     ##                  treatment: CTRL treated in charcoal dextran treated serum 
 550 |     ##                                                                          2 
 551 |     ##                                      treatment: CTRL treated in full serum 
 552 |     ##                                                                          2
 553 | 
 554 | ``` r
 555 | #Let's pick just those columns that seem to contain factors we might 
 556 | #need for the analysis.
 557 | sampleInfo <- select(sampleInfo, characteristics_ch1.1)
 558 | 
 559 | ## Optionally, rename to more convenient column names
 560 | sampleInfo <- rename(sampleInfo, sample = characteristics_ch1.1)
 561 | 
 562 | head(sampleInfo)
 563 | ```
 564 | 
 565 |     ##                                                                                sample
 566 |     ## GSM1550559                  treatment: CTRL treated in charcoal dextran treated serum
 567 |     ## GSM1550560                  treatment: CTRL treated in charcoal dextran treated serum
 568 |     ## GSM1550561 treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum
 569 |     ## GSM1550562 treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum
 570 |     ## GSM1550563   treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum
 571 |     ## GSM1550564   treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum
 572 | 
 573 | ``` r
 574 | dim(sampleInfo)
 575 | ```
 576 | 
 577 |     ## [1] 12  1
 578 | 
 579 | ``` r
 580 | sampleInfo$sample
 581 | ```
 582 | 
 583 |     ##  [1] "treatment: CTRL treated in charcoal dextran treated serum"                 
 584 |     ##  [2] "treatment: CTRL treated in charcoal dextran treated serum"                 
 585 |     ##  [3] "treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum"
 586 |     ##  [4] "treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum"
 587 |     ##  [5] "treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum"  
 588 |     ##  [6] "treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum"  
 589 |     ##  [7] "treatment: CTRL treated in full serum"                                     
 590 |     ##  [8] "treatment: CTRL treated in full serum"                                     
 591 |     ##  [9] "treatment: 16h cabazitaxel (1nM) treated in full serum"                    
 592 |     ## [10] "treatment: 16h cabazitaxel (1nM) treated in full serum"                    
 593 |     ## [11] "treatment: 16h docetaxel (1nM) treated in full serum"                      
 594 |     ## [12] "treatment: 16h docetaxel (1nM) treated in full serum"
 595 | 
 596 | ``` r
 597 | library(stringr)
 598 | sampleInfo$group <- ""
 599 | for(i in 1:nrow(sampleInfo)){
 600 |   if(str_detect(sampleInfo$sample[i], "CTRL") && str_detect(sampleInfo$sample[i], "full"))
 601 |   {sampleInfo$group[i] <- "Conf"}
 602 |   
 603 |   if(str_detect(sampleInfo$sample[i], "CTRL") && str_detect(sampleInfo$sample[i], "dextran"))
 604 |   {sampleInfo$group[i] <- "Cond"}
 605 |   
 606 |   if(str_detect(sampleInfo$sample[i], "cabazitaxel") && str_detect(sampleInfo$sample[i], "full"))
 607 |   {sampleInfo$group[i] <- "cabazitaxelf"}
 608 |   
 609 |   if(str_detect(sampleInfo$sample[i], "cabazitaxel") && str_detect(sampleInfo$sample[i], "dextran"))
 610 |   {sampleInfo$group[i] <- "cabazitaxeld"}
 611 |   
 612 |   if(str_detect(sampleInfo$sample[i], "docetaxel") && str_detect(sampleInfo$sample[i], "full"))
 613 |   {sampleInfo$group[i] <- "docetaxelf"}
 614 |   
 615 |   if(str_detect(sampleInfo$sample[i], "docetaxel") && str_detect(sampleInfo$sample[i], "dextran"))
 616 |   {sampleInfo$group[i] <- "docetaxeld"}
 617 | }
 618 | 
 619 | sampleInfo 
 620 | ```
 621 | 
 622 |     ##                                                                                sample
 623 |     ## GSM1550559                  treatment: CTRL treated in charcoal dextran treated serum
 624 |     ## GSM1550560                  treatment: CTRL treated in charcoal dextran treated serum
 625 |     ## GSM1550561 treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum
 626 |     ## GSM1550562 treatment: 16h cabazitaxel (1nM) treated in charcoal dextran treated serum
 627 |     ## GSM1550563   treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum
 628 |     ## GSM1550564   treatment: 16h docetaxel (1nM) treated in charcoal dextran treated serum
 629 |     ## GSM1550565                                      treatment: CTRL treated in full serum
 630 |     ## GSM1550566                                      treatment: CTRL treated in full serum
 631 |     ## GSM1550567                     treatment: 16h cabazitaxel (1nM) treated in full serum
 632 |     ## GSM1550568                     treatment: 16h cabazitaxel (1nM) treated in full serum
 633 |     ## GSM1550569                       treatment: 16h docetaxel (1nM) treated in full serum
 634 |     ## GSM1550570                       treatment: 16h docetaxel (1nM) treated in full serum
 635 |     ##                   group
 636 |     ## GSM1550559         Cond
 637 |     ## GSM1550560         Cond
 638 |     ## GSM1550561 cabazitaxeld
 639 |     ## GSM1550562 cabazitaxeld
 640 |     ## GSM1550563   docetaxeld
 641 |     ## GSM1550564   docetaxeld
 642 |     ## GSM1550565         Conf
 643 |     ## GSM1550566         Conf
 644 |     ## GSM1550567 cabazitaxelf
 645 |     ## GSM1550568 cabazitaxelf
 646 |     ## GSM1550569   docetaxelf
 647 |     ## GSM1550570   docetaxelf
 648 | 
 649 | ``` r
 650 | sampleInfo$serum <- ""
 651 | for(i in 1:nrow(sampleInfo)){
 652 |   if(str_detect(sampleInfo$sample[i], "dextran"))
 653 |   {sampleInfo$serum[i] <- "dextran"}
 654 |   
 655 |   if(str_detect(sampleInfo$sample[i], "full"))
 656 |   {sampleInfo$serum[i] <- "full_serum"}
 657 |  
 658 | }
 659 | 
 660 | sampleInfo <- sampleInfo[,-1]
 661 | sampleInfo
 662 | ```
 663 | 
 664 |     ##                   group      serum
 665 |     ## GSM1550559         Cond    dextran
 666 |     ## GSM1550560         Cond    dextran
 667 |     ## GSM1550561 cabazitaxeld    dextran
 668 |     ## GSM1550562 cabazitaxeld    dextran
 669 |     ## GSM1550563   docetaxeld    dextran
 670 |     ## GSM1550564   docetaxeld    dextran
 671 |     ## GSM1550565         Conf full_serum
 672 |     ## GSM1550566         Conf full_serum
 673 |     ## GSM1550567 cabazitaxelf full_serum
 674 |     ## GSM1550568 cabazitaxelf full_serum
 675 |     ## GSM1550569   docetaxelf full_serum
 676 |     ## GSM1550570   docetaxelf full_serum
 677 | 
 678 | # Sample clustering and Principal Components Analaysis
 679 | 
 680 | We can visualize the correlations between the samples by hierarchical
 681 | clustering.
 682 | 
 683 | The function ‘cor’ can calculate the correlation on the scale of 0 to 1,
 684 | in a pairwise fashion between all samples, then visualise on a heatmap.
 685 | There are many ways to create heatmaps in R, here I use ‘pheatmap’, the
 686 | only argument it requires is a matrix of numeric values.
 687 | 
 688 | We can add more sample info onto the plot to get a better pic of the
 689 | group and clustering. Here, we make use of the ‘sampleInfo’ file that
 690 | was created earlier, to match with the columns of the correlation
 691 | matrix.
 692 | 
 693 | ``` r
 694 | library(pheatmap)
 695 | ## argument use="c" stops an error if there are any missing data points
 696 | 
 697 | corMatrix <- cor(exprs(gse),use="c")
 698 | pheatmap(corMatrix)   
 699 | ```
 700 | 
 701 | ![](GSE_analysis_microarray_files/figure-gfm/unnamed-chunk-4-1.png)<!-- -->
 702 | 
 703 | ``` r
 704 | ## Print the rownames of the sample information and check it matches the correlation matrix
 705 | 
 706 | rownames(sampleInfo)
 707 | ```
 708 | 
 709 |     ##  [1] "GSM1550559" "GSM1550560" "GSM1550561" "GSM1550562" "GSM1550563"
 710 |     ##  [6] "GSM1550564" "GSM1550565" "GSM1550566" "GSM1550567" "GSM1550568"
 711 |     ## [11] "GSM1550569" "GSM1550570"
 712 | 
 713 | ``` r
 714 | colnames(corMatrix)
 715 | ```
 716 | 
 717 |     ##  [1] "GSM1550559" "GSM1550560" "GSM1550561" "GSM1550562" "GSM1550563"
 718 |     ##  [6] "GSM1550564" "GSM1550565" "GSM1550566" "GSM1550567" "GSM1550568"
 719 |     ## [11] "GSM1550569" "GSM1550570"
 720 | 
 721 | ``` r
 722 | ## If not, force the rownames to match the columns
 723 | #rownames(sampleInfo) <- colnames(corMatrix)
 724 | 
 725 | pheatmap(corMatrix, annotation_col= sampleInfo)
 726 | ```
 727 | 
 728 | ![](GSE_analysis_microarray_files/figure-gfm/unnamed-chunk-4-2.png)<!-- -->
 729 | 
 730 | Another way is to use Principal component analysis (PCA). It has to note
 731 | that the data has to be transposed, so that the genelist is in the
 732 | column, while rownames are the samples, so the PCA process will not run
 733 | out of the memory in the oher way round.
 734 | 
 735 | Let’s add labels to plot the results, here, we use the ‘ggplots2’
 736 | package, while the ‘ggrepel’ package is used to position the text labels
 737 | more cleverly so they can be read. Here we can see that the samples are
 738 | divided into two groups based on the serum treatment types.
 739 | 
 740 | ``` r
 741 | #make PCA
 742 | library(ggplot2)
 743 | library(ggrepel)
 744 | ## MAKE SURE TO TRANSPOSE THE EXPRESSION MATRIX
 745 | 
 746 | pca <- prcomp(t(exprs(gse)))
 747 | 
 748 | ## Join the PCs to the sample information
 749 | cbind(sampleInfo, pca$x) %>% 
 750 |   ggplot(aes(x = PC1, y=PC2, col=group, label=paste("",group))) + geom_point() + geom_text_repel()
 751 | ```
 752 | 
 753 | ![](GSE_analysis_microarray_files/figure-gfm/unnamed-chunk-5-1.png)<!-- -->
 754 | 
 755 | # Differential expression analysis
 756 | 
 757 | In this section, we use the limma package to perform differential
 758 | expressions. Limma stands for “Linear models for microarray”. Here, we
 759 | need to tell limma what sample groups we want to compare. I choose
 760 | sampleInfo$group. A design matrix will be created, this is a matrix of 0
 761 | and 1, one row for each sample and one column for each sample group.
 762 | 
 763 | We can rename the column names so that it is easier to see.
 764 | 
 765 | Now, let’s check if the expression data contain any lowly-expressed
 766 | genes, this will affect the quality of DE analysis. A big problem in
 767 | doing statistical analysis like limma is the inference of type 1
 768 | statistical errors, also called false positive. One simple way to reduce
 769 | the possibility for type 1 errors is to do fewer comparisons, by
 770 | filtering the data. For example, we know that not all genes are
 771 | expressed in all tissues and many genes will not be expressed in any
 772 | sample. As a result, in DGE analysis, it makes sense to remove the genes
 773 | that are likely not expressed at all.
 774 | 
 775 | It is quite subjective how one defines a gene being expressed, here, I
 776 | follow the tutorial, to make the cut off at the median of the expression
 777 | values, which means to consider around 50% of the genes will not be
 778 | expressed. Keep those expressed genes if they are present in more than 2
 779 | samples.
 780 | 
 781 | We can see that around half of the genes are not qualified as an
 782 | “expressed” gene here, which makes sense, bcoz our cut-off is the
 783 | median value.
 784 | 
 785 | ``` r
 786 | library(limma)
 787 | ```
 788 | 
 789 |     ## 
 790 |     ## Attaching package: 'limma'
 791 | 
 792 |     ## The following object is masked from 'package:BiocGenerics':
 793 |     ## 
 794 |     ##     plotMA
 795 | 
 796 | ``` r
 797 | design <- model.matrix(~0 + sampleInfo$group)
 798 | design
 799 | ```
 800 | 
 801 |     ##    sampleInfo$groupcabazitaxeld sampleInfo$groupcabazitaxelf
 802 |     ## 1                             0                            0
 803 |     ## 2                             0                            0
 804 |     ## 3                             1                            0
 805 |     ## 4                             1                            0
 806 |     ## 5                             0                            0
 807 |     ## 6                             0                            0
 808 |     ## 7                             0                            0
 809 |     ## 8                             0                            0
 810 |     ## 9                             0                            1
 811 |     ## 10                            0                            1
 812 |     ## 11                            0                            0
 813 |     ## 12                            0                            0
 814 |     ##    sampleInfo$groupCond sampleInfo$groupConf sampleInfo$groupdocetaxeld
 815 |     ## 1                     1                    0                          0
 816 |     ## 2                     1                    0                          0
 817 |     ## 3                     0                    0                          0
 818 |     ## 4                     0                    0                          0
 819 |     ## 5                     0                    0                          1
 820 |     ## 6                     0                    0                          1
 821 |     ## 7                     0                    1                          0
 822 |     ## 8                     0                    1                          0
 823 |     ## 9                     0                    0                          0
 824 |     ## 10                    0                    0                          0
 825 |     ## 11                    0                    0                          0
 826 |     ## 12                    0                    0                          0
 827 |     ##    sampleInfo$groupdocetaxelf
 828 |     ## 1                           0
 829 |     ## 2                           0
 830 |     ## 3                           0
 831 |     ## 4                           0
 832 |     ## 5                           0
 833 |     ## 6                           0
 834 |     ## 7                           0
 835 |     ## 8                           0
 836 |     ## 9                           0
 837 |     ## 10                          0
 838 |     ## 11                          1
 839 |     ## 12                          1
 840 |     ## attr(,"assign")
 841 |     ## [1] 1 1 1 1 1 1
 842 |     ## attr(,"contrasts")
 843 |     ## attr(,"contrasts")$`sampleInfo$group`
 844 |     ## [1] "contr.treatment"
 845 | 
 846 | ``` r
 847 | ## the column names are a bit ugly, so we will rename
 848 | colnames(design) <- c("Cabazitaxeld","Cabazitaxelf","Cond","Conf","Docetaxeld","Docetaxelf")
 849 | 
 850 | design
 851 | ```
 852 | 
 853 |     ##    Cabazitaxeld Cabazitaxelf Cond Conf Docetaxeld Docetaxelf
 854 |     ## 1             0            0    1    0          0          0
 855 |     ## 2             0            0    1    0          0          0
 856 |     ## 3             1            0    0    0          0          0
 857 |     ## 4             1            0    0    0          0          0
 858 |     ## 5             0            0    0    0          1          0
 859 |     ## 6             0            0    0    0          1          0
 860 |     ## 7             0            0    0    1          0          0
 861 |     ## 8             0            0    0    1          0          0
 862 |     ## 9             0            1    0    0          0          0
 863 |     ## 10            0            1    0    0          0          0
 864 |     ## 11            0            0    0    0          0          1
 865 |     ## 12            0            0    0    0          0          1
 866 |     ## attr(,"assign")
 867 |     ## [1] 1 1 1 1 1 1
 868 |     ## attr(,"contrasts")
 869 |     ## attr(,"contrasts")$`sampleInfo$group`
 870 |     ## [1] "contr.treatment"
 871 | 
 872 | ``` r
 873 | ## calculate median expression level
 874 | cutoff <- median(exprs(gse))
 875 | 
 876 | ## TRUE or FALSE for whether each gene is "expressed" in each sample
 877 | is_expressed <- exprs(gse) > cutoff
 878 | 
 879 | ## Identify genes expressed in more than 2 samples
 880 | 
 881 | keep <- rowSums(is_expressed) > 3
 882 | 
 883 | ## check how many genes are removed / retained.
 884 | table(keep)
 885 | ```
 886 | 
 887 |     ## keep
 888 |     ## FALSE  TRUE 
 889 |     ## 20965 23664
 890 | 
 891 | ``` r
 892 | ## subset to just those expressed genes
 893 | gse <- gse[keep,]
 894 | ```
 895 | 
 896 | Here there is a little extra step to find out the outliers. This has to
 897 | be done carefully so the filtered data won’t be too biased. We calculate
 898 | ‘weights’ to define the reliability of each sample. The ‘arrayweights’
 899 | function will assign a score to each sample, with a value of 1 implying
 900 | equal weight. Samples with score less than 1 are down-weighed, or else
 901 | up-weighed.
 902 | 
 903 | ``` r
 904 | # coping with outliers
 905 | ## calculate relative array weights
 906 | aw <- arrayWeights(exprs(gse),design)
 907 | aw
 908 | ```
 909 | 
 910 |     ##         1         2         3         4         5         6         7         8 
 911 |     ## 0.9704842 0.9704842 0.8790788 0.8790788 0.9799805 0.9799805 0.8296341 0.8296341 
 912 |     ##         9        10        11        12 
 913 |     ## 1.1632620 1.1632620 1.2393733 1.2393733
 914 | 
 915 | Now we have a design matrix, we need to estimate the coefficients. For
 916 | this design, we will essentially average the replicate arrays for each
 917 | sample level. In addition, we will calculate standard deviations for
 918 | each gene, and the average intensity for the genes across all
 919 | microarrays.
 920 | 
 921 | We are ready to tell limma which pairwise contrasts that we want to
 922 | make. For this experiment, we are going to contrast treatment (there are
 923 | two types of texane drugs) and control in each serum type. So there are
 924 | 4 contrasts to specify.
 925 | 
 926 | To do the statistical comparisons, Limma uses Bayesian statistics to
 927 | minimize type 1 error. The eBayes function performs the tests. To
 928 | summarize the results of the statistical test, ‘topTable’ will adjust
 929 | the p-values and return the top genes that meet the cutoffs that you
 930 | supply as arguments; while ‘decideTests’ will make calls for DEGs by
 931 | adjusting the p-values and applying a logFC cutoff similar to topTable.
 932 | 
 933 | ``` r
 934 | ## Fitting the coefficients
 935 | fit <- lmFit(exprs(gse), design,
 936 |              weights = aw)
 937 | 
 938 | head(fit$coefficients)
 939 | ```
 940 | 
 941 |     ##          Cabazitaxeld Cabazitaxelf     Cond     Conf Docetaxeld Docetaxelf
 942 |     ## 16657436     4.604278     4.590902 4.539703 4.606668   4.569910   4.504447
 943 |     ## 16657440     5.073400     5.194166 5.274991 5.001295   5.230574   5.016408
 944 |     ## 16657450     6.606782     6.532054 6.571707 6.754964   6.854709   6.495389
 945 |     ## 16657469     5.318536     5.298899 5.358629 5.191262   5.398528   5.356415
 946 |     ## 16657476     5.410581     5.443410 5.358263 5.343908   5.516173   5.335881
 947 |     ## 16657480     4.481099     4.454008 4.487220 4.453781   4.418727   4.379055
 948 | 
 949 | ``` r
 950 | ## Making comparisons between samples, can define multiple contrasts
 951 | contrasts <- makeContrasts(Docetaxeld - Cond, Cabazitaxeld - Cond, Docetaxelf - Conf, Cabazitaxelf - Conf, levels = design)
 952 | 
 953 | fit2 <- contrasts.fit(fit, contrasts)
 954 | fit2 <- eBayes(fit2)
 955 | 
 956 | 
 957 | topTable(fit2)
 958 | ```
 959 | 
 960 |     ##          Docetaxeld...Cond Cabazitaxeld...Cond Docetaxelf...Conf
 961 |     ## 16681891      -0.427866483          0.55288766       -0.02322651
 962 |     ## 16840609       0.573090433         -0.20083754       -0.21477778
 963 |     ## 17017165       0.116986540         -0.32582451       -0.44959204
 964 |     ## 16782010       0.178206104         -0.01891042       -0.29766696
 965 |     ## 17099705      -0.437302094          0.02759174       -0.16952565
 966 |     ## 16691877      -0.149430506         -0.18664917       -0.90866405
 967 |     ## 16959582      -0.006190599         -0.17371327        0.40620159
 968 |     ## 16970902      -0.416408859          0.01679171       -0.02017417
 969 |     ## 16936214      -0.112315413          0.12707501       -0.28387277
 970 |     ## 17009126      -0.645309985         -0.59167505       -0.39348344
 971 |     ##          Cabazitaxelf...Conf  AveExpr        F      P.Value  adj.P.Val
 972 |     ## 16681891          -0.1414628 5.625598 43.01843 2.992480e-06 0.07081404
 973 |     ## 16840609          -0.3278230 5.769203 18.87958 1.217387e-04 0.99909331
 974 |     ## 17017165          -0.3710638 5.244400 14.18263 4.059196e-04 0.99909331
 975 |     ## 16782010           0.2850105 4.440952 14.12466 4.128114e-04 0.99909331
 976 |     ## 17099705           0.2776297 6.750407 13.62541 4.783683e-04 0.99909331
 977 |     ## 16691877          -0.5849898 4.667038 11.52373 9.376105e-04 0.99909331
 978 |     ## 16959582           0.3098904 5.531461 11.26699 1.024647e-03 0.99909331
 979 |     ## 16970902          -0.1618155 5.237839 11.18970 1.052724e-03 0.99909331
 980 |     ## 16936214          -0.6271607 5.482977 11.17492 1.058196e-03 0.99909331
 981 |     ## 17009126          -0.3389764 4.940635 11.00507 1.123616e-03 0.99909331
 982 | 
 983 | ``` r
 984 | topTable1 <- topTable(fit2, coef=1)
 985 | topTable2 <- topTable(fit2, coef=2)
 986 | topTable3 <- topTable(fit2, coef=3)
 987 | topTable4 <- topTable(fit2, coef=4)
 988 | 
 989 | #if we want to know how many genes are differentially expressed overall, we can use the decideTest function.
 990 | summary(decideTests(fit2))
 991 | ```
 992 | 
 993 |     ##        Docetaxeld - Cond Cabazitaxeld - Cond Docetaxelf - Conf
 994 |     ## Down                   0                   0                 0
 995 |     ## NotSig             23664               23664             23664
 996 |     ## Up                     0                   0                 0
 997 |     ##        Cabazitaxelf - Conf
 998 |     ## Down                     0
 999 |     ## NotSig               23664
1000 |     ## Up                       0
1001 | 
1002 | ``` r
1003 | table(decideTests(fit2))
1004 | ```
1005 | 
1006 |     ## 
1007 |     ##     0 
1008 |     ## 94656
1009 | 
1010 | # Further visualization with gene annotation
1011 | 
1012 | Now we want to know the gene name associated with the gene ID. The
1013 | annotation data can be retrieved with the ‘fData’ function. Let’s select
1014 | the ID, GB\_ACC, this is genbank accession ID. Add into fit2 table.
1015 | 
1016 | The “Volcano Plot” function is a common way of visualising the results
1017 | of a DE analysis. The x axis shows the log-fold change and the y axis is
1018 | some measure of statistical significance, which in this case is the
1019 | log-odds, or “B” statistic. We can also change the color of those genes
1020 | with p value cutoff more than 0.05, and fold change cut off more than 1.
1021 | 
1022 | ``` r
1023 | anno <- fData(gse)
1024 | head(anno)
1025 | ```
1026 | 
1027 |     ##                ID RANGE_STRAND RANGE_START RANGE_END total_probes    GB_ACC
1028 |     ## 16657436 16657436            +       12190     13639           25 NR_046018
1029 |     ## 16657440 16657440            +       29554     31109           28          
1030 |     ## 16657450 16657450            +      317811    328581           36 NR_024368
1031 |     ## 16657469 16657469            +      329790    342507           27          
1032 |     ## 16657476 16657476            +      459656    461954           27 NR_029406
1033 |     ## 16657480 16657480            +      523009    532878           12          
1034 |     ##                     SPOT_ID     RANGE_GB
1035 |     ## 16657436   chr1:12190-13639 NC_000001.10
1036 |     ## 16657440   chr1:29554-31109 NC_000001.10
1037 |     ## 16657450 chr1:317811-328581 NC_000001.10
1038 |     ## 16657469 chr1:329790-342507 NC_000001.10
1039 |     ## 16657476 chr1:459656-461954 NC_000001.10
1040 |     ## 16657480 chr1:523009-532878 NC_000001.10
1041 | 
1042 | ``` r
1043 | anno <- select(anno,ID,GB_ACC)
1044 | fit2$genes <- anno
1045 | 
1046 | topTable(fit2)
1047 | ```
1048 | 
1049 |     ##                ID       GB_ACC Docetaxeld...Cond Cabazitaxeld...Cond
1050 |     ## 16681891 16681891 NM_001013692      -0.427866483          0.55288766
1051 |     ## 16840609 16840609                    0.573090433         -0.20083754
1052 |     ## 17017165 17017165                    0.116986540         -0.32582451
1053 |     ## 16782010 16782010                    0.178206104         -0.01891042
1054 |     ## 17099705 17099705    NR_039696      -0.437302094          0.02759174
1055 |     ## 16691877 16691877                   -0.149430506         -0.18664917
1056 |     ## 16959582 16959582                   -0.006190599         -0.17371327
1057 |     ## 16970902 16970902                   -0.416408859          0.01679171
1058 |     ## 16936214 16936214    NR_037440      -0.112315413          0.12707501
1059 |     ## 17009126 17009126                   -0.645309985         -0.59167505
1060 |     ##          Docetaxelf...Conf Cabazitaxelf...Conf  AveExpr        F      P.Value
1061 |     ## 16681891       -0.02322651          -0.1414628 5.625598 43.01843 2.992480e-06
1062 |     ## 16840609       -0.21477778          -0.3278230 5.769203 18.87958 1.217387e-04
1063 |     ## 17017165       -0.44959204          -0.3710638 5.244400 14.18263 4.059196e-04
1064 |     ## 16782010       -0.29766696           0.2850105 4.440952 14.12466 4.128114e-04
1065 |     ## 17099705       -0.16952565           0.2776297 6.750407 13.62541 4.783683e-04
1066 |     ## 16691877       -0.90866405          -0.5849898 4.667038 11.52373 9.376105e-04
1067 |     ## 16959582        0.40620159           0.3098904 5.531461 11.26699 1.024647e-03
1068 |     ## 16970902       -0.02017417          -0.1618155 5.237839 11.18970 1.052724e-03
1069 |     ## 16936214       -0.28387277          -0.6271607 5.482977 11.17492 1.058196e-03
1070 |     ## 17009126       -0.39348344          -0.3389764 4.940635 11.00507 1.123616e-03
1071 |     ##           adj.P.Val
1072 |     ## 16681891 0.07081404
1073 |     ## 16840609 0.99909331
1074 |     ## 17017165 0.99909331
1075 |     ## 16782010 0.99909331
1076 |     ## 17099705 0.99909331
1077 |     ## 16691877 0.99909331
1078 |     ## 16959582 0.99909331
1079 |     ## 16970902 0.99909331
1080 |     ## 16936214 0.99909331
1081 |     ## 17009126 0.99909331
1082 | 
1083 | ``` r
1084 | ## Create volcano plot
1085 | full_results1 <- topTable(fit2, coef=1, number=Inf)
1086 | library(ggplot2)
1087 | ggplot(full_results1,aes(x = logFC, y=B)) + geom_point()
1088 | ```
1089 | 
1090 | ![](GSE_analysis_microarray_files/figure-gfm/unnamed-chunk-9-1.png)<!-- -->
1091 | 
1092 | ``` r
1093 | ## change according to your needs
1094 | p_cutoff <- 0.05
1095 | fc_cutoff <- 1
1096 | 
1097 | 
1098 | full_results1 %>% 
1099 |   mutate(Significant = P.Value < p_cutoff, abs(logFC) > fc_cutoff ) %>% 
1100 |   ggplot(aes(x = logFC, y = B, col=Significant)) + geom_point()
1101 | ```
1102 | 
1103 | ![](GSE_analysis_microarray_files/figure-gfm/unnamed-chunk-9-2.png)<!-- -->
1104 | 
1105 | # Further visualization of selected gene
1106 | 
1107 | I think at this point, we are quite clear about data structure of GSE
1108 | data. It has an experiment data, pData; the expression data, exprs; and
1109 | also annotation data, fData. And we have learned how to check the
1110 | expression data, normalize them, and perform differential expression
1111 | analysis.
1112 | 
1113 | Now, with the differential expression gene tables, there are some
1114 | downstream analyses that we can continue, such as to export a full table
1115 | of DE genes, to generate a heatmap for your selected genes, get the gene
1116 | list for a particular pathway, or survival analysis (but this is only
1117 | for those clinical data).
1118 | 
1119 | Here, I just want to look into the fold change data of a selected gene,
1120 | whether it is significantly differential expressed or not.
1121 | 
1122 | ``` r
1123 | ## Get the results for particular gene of interest
1124 | #GB_ACC for Nkx3-1 is NM_001256339 or NM_006167
1125 | ##no NM_001256339 in this data
1126 | full_results2 <- topTable(fit2, coef=2, number=Inf)
1127 | full_results3 <- topTable(fit2, coef=3, number=Inf)
1128 | full_results4 <- topTable(fit2, coef=4, number=Inf)
1129 | filter(full_results1, GB_ACC == "NM_006167")
1130 | ```
1131 | 
1132 |     ##                ID    GB_ACC      logFC  AveExpr         t    P.Value adj.P.Val
1133 |     ## 17075536 17075536 NM_006167 -0.2379718 8.055042 -3.056052 0.01219317 0.9819604
1134 |     ##                  B
1135 |     ## 17075536 -3.374675
1136 | 
1137 | ``` r
1138 | filter(full_results2, GB_ACC == "NM_006167")
1139 | ```
1140 | 
1141 |     ##                ID    GB_ACC       logFC  AveExpr         t   P.Value adj.P.Val
1142 |     ## 17075536 17075536 NM_006167 -0.09964001 8.055042 -1.244539 0.2418169 0.9999406
1143 |     ##                  B
1144 |     ## 17075536 -4.561219
1145 | 
1146 | ``` r
1147 | filter(full_results3, GB_ACC == "NM_006167")
1148 | ```
1149 | 
1150 |     ##                ID    GB_ACC      logFC  AveExpr         t   P.Value adj.P.Val
1151 |     ## 17075536 17075536 NM_006167 0.02902121 8.055042 0.3762533 0.7146269 0.9999674
1152 |     ##                  B
1153 |     ## 17075536 -4.926393
1154 | 
1155 | ``` r
1156 | filter(full_results4, GB_ACC == "NM_006167")
1157 | ```
1158 | 
1159 |     ##                ID    GB_ACC      logFC  AveExpr         t   P.Value adj.P.Val
1160 |     ## 17075536 17075536 NM_006167 0.01019426 8.055042 0.1304658 0.8987979 0.9998834
1161 |     ##                 B
1162 |     ## 17075536 -4.95523
1163 | 
1164 | That’s all for the walk-through, thanks for reading, I hope you have
1165 | learned something new here.
1166 | 
1167 | # Acknowlegdement
1168 | 
1169 | Many thanks to the following tutorials made publicly available:
1170 | 
1171 | 1.  Introduction to microarray analysis GSE15947, by Department of
1172 |     Statistics, Purdue Univrsity
1173 |     <https://www.stat.purdue.edu/bigtap/online/docs/Introduction_to_Microarray_Analysis_GSE15947.html>
1174 | 
1175 | 2.  Mark Dunning, 2020, GEO tutorial, by Sheffield Bioinformatics Core
1176 |     <https://sbc.shef.ac.uk/geo_tutorial/tutorial.nb.html#Further_processing_and_visualisation_of_DE_results>
1177 | 


--------------------------------------------------------------------------------