├── Mfuzz_RNAseq.R
└── README.md


/Mfuzz_RNAseq.R:
--------------------------------------------------------------------------------
  1 | #Author : Amandine Velt (amandine.velt@inra.fr)
  2 | #Date : 18/11/2016
  3 | 
  4 | #################################################################################################################################################################
  5 | # A the beginning, Mfuzz was developped to perform clustering of gene expression from microarray data.
  6 | # To adapt this method on RNA-seq data, the author suggest to do some additional preprocessing. For instance, starting from FPKMs (normalization by gene length)
  7 | # and exclude genes which do not show expression (i.e. with FPKM equals zero).
  8 | #
  9 | # This script takes as input a directory path containing all (and only) the RNA-seq raw count data tables (eg a directory with one htseq-count file per sample)
 10 | # and performs the DESeq normalization method (normalization by library size) and then calculates the RPKM. After gene length normalization, this script performs 
 11 | # the clustering of gene expression time-series RNA-seq data with Mfuzz.
 12 | #
 13 | # Usage :
 14 | # Complete command : /usr/bin/Rscript Mfuzz_RNAseq.R -f count_files_folder -a annotation -b gene_name_attribute -t time -n nb_clusters -m membership_cutoff -s 0 -e 0.25 -r "mean" -o output_directory
 15 | # Minimal command : /usr/bin/Rscript Mfuzz_RNAseq.R -f count_files_folder -a annotation -t time
 16 | #
 17 | # Arguments description :
 18 | # count_files_folder -> directory containing all the raw count data tables (one per sample)
 19 | # annotation -> a gtf or gff file with transcripts/genes information allowing to calculate the genes length (sum of the exons length, overlap of exons is take into account)
 20 | # gene_name_attribute -> the name of the attribute in the gtf referring to the gene information
 21 | # time -> give the time value of each file by respecting the same order in the vector than the files in the folder.
 22 | #   if several files correspond to a same time (replicates), give the same time value and then the script performs the mean on the normalized counts of all the
 23 | #   samples of a same time
 24 | # nb_clusters -> number of clusters to generate with Mfuzz (empirical choice)
 25 | # membership_cutoff -> the membership cut-off to use with Mfuzz -> see the Mfuzz paper : http://www.bioinformation.net/002/000200022007.pdf
 26 | #                       you can give one value, eg "0.7" or several values separated by ",", eg '0.5,0.7'
 27 | # output -> directory where store the results
 28 | #
 29 | # Examples of arguments to give :
 30 | # count_files_folder="/home/user/count_files_folder"
 31 | # annotation="HS.Genes.v2.gff"
 32 | # gene_name_attribute="gene"
 33 | # time="time1,time1,time1,time2,time2,time2,time3"
 34 | # nb_clusters = 4
 35 | # membership_cutoff = '0.5,0.7'
 36 | # min_std_threshold= 0
 37 | # exclude_thres= 0.25
 38 | # replacement_mode="mean"
 39 | # output="/home/user/cluster_output"
 40 | #################################################################################################################################################################
 41 | 
 42 | # libraries dependencies
 43 | suppressMessages(library("optparse"))
 44 | suppressMessages(library("tools"))
 45 | suppressMessages(library("Mfuzz"))
 46 | suppressMessages(library("GenomicFeatures"))
 47 | suppressMessages(library("DESeq"))
 48 | suppressMessages(library("edgeR"))
 49 | 
 50 | # options of the script
 51 | option_list = list(
 52 |   make_option(c("-f", "--folder"), type="character", default=NULL, 
 53 |     help="[REQUIRED] Directory containing all the raw count data tables (one per sample)", metavar="character"),
 54 |   make_option(c("-a", "--annotation"), type="character", default=NULL, 
 55 |     help="[REQUIRED] A gtf or gff file with transcripts/genes information allowing to calculate the genes length (sum of the exons length, overlap of exons is take into account)", metavar="character"),
 56 |   make_option(c("-t", "--time"), type="character", default=NULL, 
 57 |     help="[REQUIRED] Give the time value of each file by respecting the same order in the vector than the files in the folder. 
 58 |             Give a list of type 'time1,time1,time1,time2,time2,time2,time3'. 
 59 |             If several files correspond to a same time (replicates), give the same time value and then the script performs the mean on the normalized counts of all the samples of a same time", metavar="character"),
 60 |   make_option(c("-b", "--gene_attribute"), type="character", default="gene", 
 61 |     help="The name of the attribute in the gtf referring to the gene information [default= %default]", metavar="character"),
 62 |   make_option(c("-n", "--nb_clusters"), type="integer", default=as.numeric(4), 
 63 |     help="Number of clusters to generate with Mfuzz (empirical choice) [default= %default]", metavar="integer"),
 64 |   make_option(c("-m", "--membership_cutoff"), type="character", default=as.numeric(0.7), 
 65 |     help="The membership cut-off to use to generate gene lists for each cluster with Mfuzz.
 66 |           By default, genes having a membership value of 0.7 for the cluster are recovered in the list for this cluster.
 67 |           You can give one value, eg '0.7' or several values separated by ",", eg '0.5,0.7'.
 68 |           See the Mfuzz paper : http://www.bioinformation.net/002/000200022007.pdf [default= %default]", metavar="character"),
 69 |   make_option(c("-s", "--min_std"), type="double", default=as.numeric(0), 
 70 |     help="Threshold for minimum standard deviation, use by Mfuzz. If the standard deviation of a gene's expression is smaller than min.std the corresponding gene will be excluded.
 71 |           Default : no filtering. [default= %default]", metavar="double"),
 72 |   make_option(c("-e", "--exclude_thres"), type="double", default=as.numeric(0.25), 
 73 |     help="Exclude genes with more than n% of the measurements missing [default= %default] -> by default, genes with 25% of the measurements missing are excluded.", metavar="double"),
 74 |   make_option(c("-r", "--replacement_mode"), type="character", default="mean", 
 75 |     help="Mode method for replacement of missing values. Fuzzy  c-means  like  many  other  cluster  algorithms,  does  not  allow  for  missing  values.
 76 |             Thus, by default, we  timelace  remaining  missing  values  by  the  average  values  expression  value  of  the corresponding gene. [default= %default]
 77 |             Other available methods : median, knn, knnw", metavar="character"),
 78 |   make_option(c("-o", "--output"), type="character", default=NULL,
 79 |     help="The directory where store the results. By default, the current directory. [default= %default]", metavar="character")
 80 | ); 
 81 | 
 82 | # parsing of the arguments
 83 | opt_parser = OptionParser(option_list=option_list);
 84 | opt = parse_args(opt_parser);
 85 | 
 86 | # test the three essential arguments and exit if one of them in not given
 87 | if (is.null(opt$folder)){
 88 |   print_help(opt_parser)
 89 |   stop("At least one argument must be supplied (input folder).n", call.=FALSE)
 90 | }
 91 | if (is.null(opt$annotation)){
 92 |   print_help(opt_parser)
 93 |   stop("At least one argument must be supplied (GFF/GTF file).n", call.=FALSE)
 94 | }
 95 | if (is.null(opt$time)){
 96 |   print_help(opt_parser)
 97 |   stop("At least one argument must be supplied (time values vector).n", call.=FALSE)
 98 | }
 99 | 
100 | # variable assignment
101 | count_files_folder=opt$folder
102 | annotation=opt$annotation
103 | gene_name_attribute=opt$gene_attribute
104 | time=as.vector(strsplit(opt$time,","))[[1]]
105 | nb_clusters=opt$nb_clusters
106 | membership_cutoff=as.vector(strsplit(opt$membership_cutoff,","))[[1]]
107 | min_std_threshold=opt$min_std
108 | exclude_thres=opt$exclude_thres
109 | replacement_mode=opt$replacement_mode
110 | output=opt$output
111 | 
112 | # test of the output, if empty, give the current directory
113 | if (is.null(output)){
114 |   output=getwd()
115 | }
116 | # create output if doesn't exists
117 | dir.create(output, showWarnings = FALSE)
118 | 
119 | ###########################################################################################################################
120 | # Normalization part
121 | ###########################################################################################################################
122 | 
123 | # determine the extension of the annotation file (may be gtf or gff)
124 | annotation_ext=file_ext(annotation)
125 | # create the object with all count files
126 | files=list.files(count_files_folder)
127 | raw=readDGE(files, path=count_files_folder, group=c(1:length(files)), columns=c(1,2))
128 | # create transcripts database from gtf or gff
129 | txdb=makeTxDbFromGFF(annotation,format=annotation_ext)
130 | # then collect the exons per gene id
131 | exons.list.per.gene=exonsBy(txdb,by=gene_name_attribute)
132 | # then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
133 | exonic.gene.sizes=as.data.frame(sum(width(reduce(exons.list.per.gene))))
134 | colnames(exonic.gene.sizes)="gene_length_bp"
135 | # our raw data table containing all the samples
136 | datafile=raw$counts
137 | # remove all the genes with a 0 count in all samples
138 | data = datafile[apply(datafile,1,sum)!=0,]
139 | # determine number of studied samples
140 | nblib= dim(data)[2]
141 | # create a factice vector for DESeq normalization
142 | conds = factor(1:nblib)
143 | # normalize raw read counts by library size with DESeq method
144 | cds = newCountDataSet(data, conds)
145 | cds = estimateSizeFactors(cds)
146 | datanorm = t(t(data)/sizeFactors(cds))
147 | colnames(datanorm)=paste(colnames(data),"normalized_by_DESeq", sep="_")
148 | # merge the raw read counts and the read counts normalized by DESeq
149 | alldata = merge(data, datanorm, by="row.names", all=T)
150 | alldata_tmp=alldata[,-1]
151 | rownames(alldata_tmp)=alldata[,1]
152 | alldata=alldata_tmp
153 | # merge the raw and normalized read counts with the genes length (in bp)
154 | alldata_tmp = merge(alldata, exonic.gene.sizes, by="row.names", all.x=T)
155 | alldata=alldata_tmp[,-1]
156 | rownames(alldata)=alldata_tmp[,1]
157 | # recover of the start of normalized read count columns and the end
158 | start=length(files)+1
159 | end=length(files)*2
160 | # calcul of the RPKM
161 | data_norm=merge(datanorm,exonic.gene.sizes, by="row.names", all.x=T)
162 | data_norm_tmp=data_norm[,-1]
163 | rownames(data_norm_tmp)=data_norm[,1]
164 | data_norm=data_norm_tmp
165 | rpkm=rpkm(data_norm[,1:dim(data_norm)[2]-1],gene.length=data_norm$"gene_length_bp", normalized.lib.sizes=FALSE, log=FALSE)
166 | colnames(rpkm)=paste(colnames(data),"normalized_by_DESeq_and_divided_by_gene_length", sep="_")
167 | # merge of rpkn with the table containing raw and normalized read counts and gene length
168 | alldata_tmp = merge(alldata, rpkm, by="row.names", all.x=T)
169 | alldata=alldata_tmp[,-1]
170 | rownames(alldata)=alldata_tmp[,1]
171 | # write of this table containing the raw read counts and the different normalization
172 | write.table(alldata, paste(output,"normalized_counts_all_genes.txt",sep="/"), sep="\t", quote=F, row.names=T, dec=".")
173 | 
174 | ###########################################################################################################################
175 | # Mfuzz part
176 | ###########################################################################################################################
177 | 
178 | # determine the first RPKN column in the alldata object -> RPKN are used by Mfuzz for genes clustering
179 | first_rpkm_column=dim(alldata)[2]-length(files)+1
180 | # here we create a matrix containing all the RPKN columns, used by Mfuzz for the clustering
181 | exprs=as.matrix(alldata[,first_rpkm_column:dim(alldata)[2]])
182 | # and for each time value containing replicates, we calculate the RPKN count means
183 | # if there are no replicates, we keep the initial RPKN
184 | count=1
185 | for ( i in unique(time) ){
186 |   if ( dim(as.data.frame(exprs[,which(time==i)]))[2] == 1 ){
187 |     mean_rpkm=data.frame(exprs[,which(time==i)])
188 |   } else {
189 |     mean_rpkm=data.frame(rowMeans(exprs[,which(time==i)]))
190 |   }
191 |   colnames(mean_rpkm)=i
192 |   if (count == 1){
193 |     mean_rpkm_ok=mean_rpkm
194 |   } else {
195 |     mean_rpkm_ok=merge(mean_rpkm_ok,mean_rpkm,by="row.names")
196 |     rownames(mean_rpkm_ok)=mean_rpkm_ok[,1]
197 |     mean_rpkm_ok=mean_rpkm_ok[,-1]
198 |   }
199 |   count=count+1
200 | }
201 | # here we have a RPKN matrix containing one column per time value (and not one column per sample)
202 | exprs_with_time=as.matrix(mean_rpkm_ok, header=TRUE, sep="\t",row.names=1,as.is=TRUE)
203 | 
204 | # we create the Mfuzz object (ExpressionSet)
205 | exprSet=ExpressionSet(assayData=exprs_with_time)
206 | 
207 | #--------------------------------------------------------------------------------------------------------------------------
208 | # As a first step,  we exclude genes with more than 25% of the measurements missing 
209 | # -> genes with 0 RPKN in 25% of the conditions  
210 | exprSet.r=filter.NA(exprSet, thres=exclude_thres)
211 | #--------------------------------------------------------------------------------------------------------------------------
212 | 
213 | #--------------------------------------------------------------------------------------------------------------------------
214 | # Fuzzy  c-means  like  many  other  cluster  algorithms,  does  not  allow  for  missing  values.
215 | # Thus,  we  timelace  remaining  missing  values  by  the  average  values  expression  value  of  the
216 | # corresponding gene.
217 | # Methods for replacement of missing values. Missing values should be indicated by NA in the expression matrix.
218 | # Mode method for replacement of missing values:
219 |   # mean- missing values will be replaced by the mean expression value of the gene,
220 |   # median- missing values will be replaced by the median expression value of the gene,
221 |   # knn- missing values will be replaced by the averging over the corresponding expression values of the k-nearest neighbours,
222 |   # knnw-same replacement method as knn, but the expression values averaged are weighted by the distance to the corresponding neighbour
223 | exprSet.f=fill.NA(exprSet.r,mode=replacement_mode)
224 | #--------------------------------------------------------------------------------------------------------------------------
225 | 
226 | #--------------------------------------------------------------------------------------------------------------------------
227 | # As soft clustering is noise robust, pre-filtering can usually be avoided. 
228 | # However, if the number of genes with small expression changes is large, such pre-filtering may be necessary to reduce noise. 
229 | # This function can be used to exclude genes with low standard deviation.
230 | # min.std : threshold for minimum standard deviation. 
231 | # If the standard deviation of a gene's expression is smaller than min.std the corresponding gene will be excluded.
232 | tmp=filter.std(exprSet.f,min.std=min_std_threshold, visu=FALSE)
233 | #--------------------------------------------------------------------------------------------------------------------------
234 | 
235 | #--------------------------------------------------------------------------------------------------------------------------
236 | # Since  the  clustering  is  performed  in  Euclidian  space,  the  expression  values  of  genes  were
237 | # standardised to have a mean value of zero and a standard deviation of one.  This step ensures
238 | # that vectors of genes with similar changes in expression are close in Euclidean space
239 | # Importantly, Mfuzz assumes that the given expression data are fully preprocessed including  any  data  normalisation.
240 | # The  function standardise does  not  replace  the  normalisation step (eg RPKN normalization).
241 | exprSet.s=standardise(tmp)
242 | #--------------------------------------------------------------------------------------------------------------------------
243 | 
244 | #--------------------------------------------------------------------------------------------------------------------------
245 | # clustering
246 | m1=mestimate(exprSet.s)
247 | cl=mfuzz(exprSet.s,c=nb_clusters,m=m1)
248 | #--------------------------------------------------------------------------------------------------------------------------
249 | 
250 | for (membership in membership_cutoff){
251 |   membership=as.numeric(membership)
252 |   # create one output folder per membership
253 |   dir=paste(output,paste("cluster_with_membership",membership, sep="_"),sep="/")
254 |   dir.create(dir, showWarnings = FALSE)
255 |   #--------------------------------------------------------------------------------------------------------------------------
256 |   # membership cut-off part and plot clusters
257 |   pdf(paste(dir,paste(paste("clusters_Mfuzz_membership_equals_",membership,sep=""),".pdf",sep=""), sep="/"))
258 |   mfuzz.plot2(exprSet.s,cl=cl,time.labels=unique(time),min.mem=membership, colo="fancy", x11=FALSE)
259 |   dev.off()
260 |   #--------------------------------------------------------------------------------------------------------------------------
261 | 
262 |   #--------------------------------------------------------------------------------------------------------------------------
263 |   # generates one genes list per cluster
264 |   acore.list=acore(exprSet.s,cl=cl,min.acore=membership)
265 |   print("----")
266 |   print(paste("Membership",membership,sep=" : "))
267 |   for (cluster in 1:nb_clusters){
268 |     print(paste(paste("Number of genes in cluster", cluster, sep=" "),dim(acore.list[[cluster]])[1], sep=" : "))
269 |     cluster_table=merge(alldata,acore.list[[cluster]][2], by="row.names", all.y=TRUE)
270 |     write.table(cluster_table,paste(dir,paste(paste("list_of_genes_in_cluster",cluster,sep="_"),".txt"),sep="/"), sep="\t",row.names=F, dec=".")
271 |   }
272 |   print("----")
273 |   #--------------------------------------------------------------------------------------------------------------------------
274 | }
275 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Mfuzz_RNAseq.R
 2 | A R script to perform clustering of gene expression time-series RNA-seq data with Mfuzz.
 3 | 
 4 | Required R libraries :
 5 | optparse, tools, Mfuzz, GenomicFeatures, DESeq, edgeR
 6 | 
 7 | Mfuzz webpage : http://mfuzz.sysbiolab.eu/
 8 | Mfuzz paper : http://w3.ualg.pt/%7Emfutschik/publications/bioinformation.pdf
 9 | 
10 | Mfuzz_RNAseq.R take as input a set of RNA-seq count tables, one per sample, from HTSeq-count for example. All the RNA-seq count tables must be contain in a same folder, given in input of the script. 
11 | 
12 | For example, a folder containing four count data files : Sample1.txt,Sample2.txt,Sample3.txt,Sample4.txt
13 | 
14 | Sample1.txt contains the following data, without header :
15 | 
16 | | GeneID1       | S1Count1        |
17 | | ---| ---
18 | | GeneID2       | S1Count2        |
19 | | GeneID3       | S1Count3        |
20 | | GeneID4       | S1Count4        |
21 | 
22 | And Mfuzz_RNAseq.R read all the file and generates :
23 | 
24 | | GeneID        | Sample1         | Sample2         | Sample3         | Sample4         |
25 | --- | --- | ---| ---| ---
26 | | GeneID2       | S1Count2        | S2Count2        | S3Count2        | S4Count2        |
27 | | GeneID3       | S1Count3        | S2Count3        | S3Count3        | S4Count3        |
28 | | GeneID4       | S1Count4        | S2Count4        | S3Count4        | S4Count4        |
29 | 
30 | From this table, Mfuzz_RNAseq.R performs a complete RNAseq data normalization and then uses Mfuzz package to perform a soft clustering of gene expression time-series data.
31 | 
32 | Normalization steps : From the input count tables, the Mfuzz_RNAseq.R script performs a library size normalization with DESeq method and then adjust these normalized data for gene length (normalized data / gene length). These normalization steps are carried out to make all the samples comparable, which is required by Mfuzz package.
33 | 
34 | Soft clustering steps : With these last normalized data (called RPKN data), the Mfuzz_RNAseq.R script performs a genes clustering analysis with Mfuzz package, generating clusters and associated genes lists.
35 | 
36 | This script has three principal inputs : 
37 | - the argument "--folder" or "-f" which is the directory containing all the RNA-seq count tables (and only these files).
38 |   Mfuzz_RNAseq.R will read and merge all these tables and will perform the normalization steps.
39 | - the argument "--annotation" or "-a" is the path to an genes/transcripts annotation file (gff or gtf format), allowing 
40 |   to calculate the genes length (sum of the exons length, overlap of exons is take into account). This lengths are used 
41 |   during the data normalization by gene length.
42 | - the argument "--time" or "-t" give the time value of each file by respecting the same order in the vector than the files in 
43 |   the folder. This  is a list of type 'time1,time1,time1,time2,time2,time2,time3'. 
44 |   If several files correspond to a same time (replicates), give the same time value and then the script performs the mean on 
45 |   the normalized counts of all the samples of a same time to perform the soft clustering.
46 | 
47 | For a description of optional arguments, type : 
48 |       /usr/bin/Rscript Mfuzz_RNAseq.R -h
49 | 
50 | Minimal command: 
51 |       /usr/bin/Rscript Mfuzz_RNAseq.R -f count_files_folder -a annotation -t time
52 | 
53 | Complete command: 
54 |       /usr/bin/Rscript Mfuzz_RNAseq.R -f count_files_folder -a annotation -b gene_name_attribute -t time -n nb_clusters 
55 |       -m membership_cutoff -s min_std -e exclude_thres -r replacement_mode -o output_directory
56 | 


--------------------------------------------------------------------------------