├── Mfuzz_RNAseq.R └── README.md /Mfuzz_RNAseq.R: -------------------------------------------------------------------------------- 1 | #Author : Amandine Velt (amandine.velt@inra.fr) 2 | #Date : 18/11/2016 3 | 4 | ################################################################################################################################################################# 5 | # A the beginning, Mfuzz was developped to perform clustering of gene expression from microarray data. 6 | # To adapt this method on RNA-seq data, the author suggest to do some additional preprocessing. For instance, starting from FPKMs (normalization by gene length) 7 | # and exclude genes which do not show expression (i.e. with FPKM equals zero). 8 | # 9 | # This script takes as input a directory path containing all (and only) the RNA-seq raw count data tables (eg a directory with one htseq-count file per sample) 10 | # and performs the DESeq normalization method (normalization by library size) and then calculates the RPKM. After gene length normalization, this script performs 11 | # the clustering of gene expression time-series RNA-seq data with Mfuzz. 12 | # 13 | # Usage : 14 | # Complete command : /usr/bin/Rscript Mfuzz_RNAseq.R -f count_files_folder -a annotation -b gene_name_attribute -t time -n nb_clusters -m membership_cutoff -s 0 -e 0.25 -r "mean" -o output_directory 15 | # Minimal command : /usr/bin/Rscript Mfuzz_RNAseq.R -f count_files_folder -a annotation -t time 16 | # 17 | # Arguments description : 18 | # count_files_folder -> directory containing all the raw count data tables (one per sample) 19 | # annotation -> a gtf or gff file with transcripts/genes information allowing to calculate the genes length (sum of the exons length, overlap of exons is take into account) 20 | # gene_name_attribute -> the name of the attribute in the gtf referring to the gene information 21 | # time -> give the time value of each file by respecting the same order in the vector than the files in the folder. 22 | # if several files correspond to a same time (replicates), give the same time value and then the script performs the mean on the normalized counts of all the 23 | # samples of a same time 24 | # nb_clusters -> number of clusters to generate with Mfuzz (empirical choice) 25 | # membership_cutoff -> the membership cut-off to use with Mfuzz -> see the Mfuzz paper : http://www.bioinformation.net/002/000200022007.pdf 26 | # you can give one value, eg "0.7" or several values separated by ",", eg '0.5,0.7' 27 | # output -> directory where store the results 28 | # 29 | # Examples of arguments to give : 30 | # count_files_folder="/home/user/count_files_folder" 31 | # annotation="HS.Genes.v2.gff" 32 | # gene_name_attribute="gene" 33 | # time="time1,time1,time1,time2,time2,time2,time3" 34 | # nb_clusters = 4 35 | # membership_cutoff = '0.5,0.7' 36 | # min_std_threshold= 0 37 | # exclude_thres= 0.25 38 | # replacement_mode="mean" 39 | # output="/home/user/cluster_output" 40 | ################################################################################################################################################################# 41 | 42 | # libraries dependencies 43 | suppressMessages(library("optparse")) 44 | suppressMessages(library("tools")) 45 | suppressMessages(library("Mfuzz")) 46 | suppressMessages(library("GenomicFeatures")) 47 | suppressMessages(library("DESeq")) 48 | suppressMessages(library("edgeR")) 49 | 50 | # options of the script 51 | option_list = list( 52 | make_option(c("-f", "--folder"), type="character", default=NULL, 53 | help="[REQUIRED] Directory containing all the raw count data tables (one per sample)", metavar="character"), 54 | make_option(c("-a", "--annotation"), type="character", default=NULL, 55 | help="[REQUIRED] A gtf or gff file with transcripts/genes information allowing to calculate the genes length (sum of the exons length, overlap of exons is take into account)", metavar="character"), 56 | make_option(c("-t", "--time"), type="character", default=NULL, 57 | help="[REQUIRED] Give the time value of each file by respecting the same order in the vector than the files in the folder. 58 | Give a list of type 'time1,time1,time1,time2,time2,time2,time3'. 59 | If several files correspond to a same time (replicates), give the same time value and then the script performs the mean on the normalized counts of all the samples of a same time", metavar="character"), 60 | make_option(c("-b", "--gene_attribute"), type="character", default="gene", 61 | help="The name of the attribute in the gtf referring to the gene information [default= %default]", metavar="character"), 62 | make_option(c("-n", "--nb_clusters"), type="integer", default=as.numeric(4), 63 | help="Number of clusters to generate with Mfuzz (empirical choice) [default= %default]", metavar="integer"), 64 | make_option(c("-m", "--membership_cutoff"), type="character", default=as.numeric(0.7), 65 | help="The membership cut-off to use to generate gene lists for each cluster with Mfuzz. 66 | By default, genes having a membership value of 0.7 for the cluster are recovered in the list for this cluster. 67 | You can give one value, eg '0.7' or several values separated by ",", eg '0.5,0.7'. 68 | See the Mfuzz paper : http://www.bioinformation.net/002/000200022007.pdf [default= %default]", metavar="character"), 69 | make_option(c("-s", "--min_std"), type="double", default=as.numeric(0), 70 | help="Threshold for minimum standard deviation, use by Mfuzz. If the standard deviation of a gene's expression is smaller than min.std the corresponding gene will be excluded. 71 | Default : no filtering. [default= %default]", metavar="double"), 72 | make_option(c("-e", "--exclude_thres"), type="double", default=as.numeric(0.25), 73 | help="Exclude genes with more than n% of the measurements missing [default= %default] -> by default, genes with 25% of the measurements missing are excluded.", metavar="double"), 74 | make_option(c("-r", "--replacement_mode"), type="character", default="mean", 75 | help="Mode method for replacement of missing values. Fuzzy c-means like many other cluster algorithms, does not allow for missing values. 76 | Thus, by default, we timelace remaining missing values by the average values expression value of the corresponding gene. [default= %default] 77 | Other available methods : median, knn, knnw", metavar="character"), 78 | make_option(c("-o", "--output"), type="character", default=NULL, 79 | help="The directory where store the results. By default, the current directory. [default= %default]", metavar="character") 80 | ); 81 | 82 | # parsing of the arguments 83 | opt_parser = OptionParser(option_list=option_list); 84 | opt = parse_args(opt_parser); 85 | 86 | # test the three essential arguments and exit if one of them in not given 87 | if (is.null(opt$folder)){ 88 | print_help(opt_parser) 89 | stop("At least one argument must be supplied (input folder).n", call.=FALSE) 90 | } 91 | if (is.null(opt$annotation)){ 92 | print_help(opt_parser) 93 | stop("At least one argument must be supplied (GFF/GTF file).n", call.=FALSE) 94 | } 95 | if (is.null(opt$time)){ 96 | print_help(opt_parser) 97 | stop("At least one argument must be supplied (time values vector).n", call.=FALSE) 98 | } 99 | 100 | # variable assignment 101 | count_files_folder=opt$folder 102 | annotation=opt$annotation 103 | gene_name_attribute=opt$gene_attribute 104 | time=as.vector(strsplit(opt$time,","))[[1]] 105 | nb_clusters=opt$nb_clusters 106 | membership_cutoff=as.vector(strsplit(opt$membership_cutoff,","))[[1]] 107 | min_std_threshold=opt$min_std 108 | exclude_thres=opt$exclude_thres 109 | replacement_mode=opt$replacement_mode 110 | output=opt$output 111 | 112 | # test of the output, if empty, give the current directory 113 | if (is.null(output)){ 114 | output=getwd() 115 | } 116 | # create output if doesn't exists 117 | dir.create(output, showWarnings = FALSE) 118 | 119 | ########################################################################################################################### 120 | # Normalization part 121 | ########################################################################################################################### 122 | 123 | # determine the extension of the annotation file (may be gtf or gff) 124 | annotation_ext=file_ext(annotation) 125 | # create the object with all count files 126 | files=list.files(count_files_folder) 127 | raw=readDGE(files, path=count_files_folder, group=c(1:length(files)), columns=c(1,2)) 128 | # create transcripts database from gtf or gff 129 | txdb=makeTxDbFromGFF(annotation,format=annotation_ext) 130 | # then collect the exons per gene id 131 | exons.list.per.gene=exonsBy(txdb,by=gene_name_attribute) 132 | # then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then 133 | exonic.gene.sizes=as.data.frame(sum(width(reduce(exons.list.per.gene)))) 134 | colnames(exonic.gene.sizes)="gene_length_bp" 135 | # our raw data table containing all the samples 136 | datafile=raw$counts 137 | # remove all the genes with a 0 count in all samples 138 | data = datafile[apply(datafile,1,sum)!=0,] 139 | # determine number of studied samples 140 | nblib= dim(data)[2] 141 | # create a factice vector for DESeq normalization 142 | conds = factor(1:nblib) 143 | # normalize raw read counts by library size with DESeq method 144 | cds = newCountDataSet(data, conds) 145 | cds = estimateSizeFactors(cds) 146 | datanorm = t(t(data)/sizeFactors(cds)) 147 | colnames(datanorm)=paste(colnames(data),"normalized_by_DESeq", sep="_") 148 | # merge the raw read counts and the read counts normalized by DESeq 149 | alldata = merge(data, datanorm, by="row.names", all=T) 150 | alldata_tmp=alldata[,-1] 151 | rownames(alldata_tmp)=alldata[,1] 152 | alldata=alldata_tmp 153 | # merge the raw and normalized read counts with the genes length (in bp) 154 | alldata_tmp = merge(alldata, exonic.gene.sizes, by="row.names", all.x=T) 155 | alldata=alldata_tmp[,-1] 156 | rownames(alldata)=alldata_tmp[,1] 157 | # recover of the start of normalized read count columns and the end 158 | start=length(files)+1 159 | end=length(files)*2 160 | # calcul of the RPKM 161 | data_norm=merge(datanorm,exonic.gene.sizes, by="row.names", all.x=T) 162 | data_norm_tmp=data_norm[,-1] 163 | rownames(data_norm_tmp)=data_norm[,1] 164 | data_norm=data_norm_tmp 165 | rpkm=rpkm(data_norm[,1:dim(data_norm)[2]-1],gene.length=data_norm$"gene_length_bp", normalized.lib.sizes=FALSE, log=FALSE) 166 | colnames(rpkm)=paste(colnames(data),"normalized_by_DESeq_and_divided_by_gene_length", sep="_") 167 | # merge of rpkn with the table containing raw and normalized read counts and gene length 168 | alldata_tmp = merge(alldata, rpkm, by="row.names", all.x=T) 169 | alldata=alldata_tmp[,-1] 170 | rownames(alldata)=alldata_tmp[,1] 171 | # write of this table containing the raw read counts and the different normalization 172 | write.table(alldata, paste(output,"normalized_counts_all_genes.txt",sep="/"), sep="\t", quote=F, row.names=T, dec=".") 173 | 174 | ########################################################################################################################### 175 | # Mfuzz part 176 | ########################################################################################################################### 177 | 178 | # determine the first RPKN column in the alldata object -> RPKN are used by Mfuzz for genes clustering 179 | first_rpkm_column=dim(alldata)[2]-length(files)+1 180 | # here we create a matrix containing all the RPKN columns, used by Mfuzz for the clustering 181 | exprs=as.matrix(alldata[,first_rpkm_column:dim(alldata)[2]]) 182 | # and for each time value containing replicates, we calculate the RPKN count means 183 | # if there are no replicates, we keep the initial RPKN 184 | count=1 185 | for ( i in unique(time) ){ 186 | if ( dim(as.data.frame(exprs[,which(time==i)]))[2] == 1 ){ 187 | mean_rpkm=data.frame(exprs[,which(time==i)]) 188 | } else { 189 | mean_rpkm=data.frame(rowMeans(exprs[,which(time==i)])) 190 | } 191 | colnames(mean_rpkm)=i 192 | if (count == 1){ 193 | mean_rpkm_ok=mean_rpkm 194 | } else { 195 | mean_rpkm_ok=merge(mean_rpkm_ok,mean_rpkm,by="row.names") 196 | rownames(mean_rpkm_ok)=mean_rpkm_ok[,1] 197 | mean_rpkm_ok=mean_rpkm_ok[,-1] 198 | } 199 | count=count+1 200 | } 201 | # here we have a RPKN matrix containing one column per time value (and not one column per sample) 202 | exprs_with_time=as.matrix(mean_rpkm_ok, header=TRUE, sep="\t",row.names=1,as.is=TRUE) 203 | 204 | # we create the Mfuzz object (ExpressionSet) 205 | exprSet=ExpressionSet(assayData=exprs_with_time) 206 | 207 | #-------------------------------------------------------------------------------------------------------------------------- 208 | # As a first step, we exclude genes with more than 25% of the measurements missing 209 | # -> genes with 0 RPKN in 25% of the conditions 210 | exprSet.r=filter.NA(exprSet, thres=exclude_thres) 211 | #-------------------------------------------------------------------------------------------------------------------------- 212 | 213 | #-------------------------------------------------------------------------------------------------------------------------- 214 | # Fuzzy c-means like many other cluster algorithms, does not allow for missing values. 215 | # Thus, we timelace remaining missing values by the average values expression value of the 216 | # corresponding gene. 217 | # Methods for replacement of missing values. Missing values should be indicated by NA in the expression matrix. 218 | # Mode method for replacement of missing values: 219 | # mean- missing values will be replaced by the mean expression value of the gene, 220 | # median- missing values will be replaced by the median expression value of the gene, 221 | # knn- missing values will be replaced by the averging over the corresponding expression values of the k-nearest neighbours, 222 | # knnw-same replacement method as knn, but the expression values averaged are weighted by the distance to the corresponding neighbour 223 | exprSet.f=fill.NA(exprSet.r,mode=replacement_mode) 224 | #-------------------------------------------------------------------------------------------------------------------------- 225 | 226 | #-------------------------------------------------------------------------------------------------------------------------- 227 | # As soft clustering is noise robust, pre-filtering can usually be avoided. 228 | # However, if the number of genes with small expression changes is large, such pre-filtering may be necessary to reduce noise. 229 | # This function can be used to exclude genes with low standard deviation. 230 | # min.std : threshold for minimum standard deviation. 231 | # If the standard deviation of a gene's expression is smaller than min.std the corresponding gene will be excluded. 232 | tmp=filter.std(exprSet.f,min.std=min_std_threshold, visu=FALSE) 233 | #-------------------------------------------------------------------------------------------------------------------------- 234 | 235 | #-------------------------------------------------------------------------------------------------------------------------- 236 | # Since the clustering is performed in Euclidian space, the expression values of genes were 237 | # standardised to have a mean value of zero and a standard deviation of one. This step ensures 238 | # that vectors of genes with similar changes in expression are close in Euclidean space 239 | # Importantly, Mfuzz assumes that the given expression data are fully preprocessed including any data normalisation. 240 | # The function standardise does not replace the normalisation step (eg RPKN normalization). 241 | exprSet.s=standardise(tmp) 242 | #-------------------------------------------------------------------------------------------------------------------------- 243 | 244 | #-------------------------------------------------------------------------------------------------------------------------- 245 | # clustering 246 | m1=mestimate(exprSet.s) 247 | cl=mfuzz(exprSet.s,c=nb_clusters,m=m1) 248 | #-------------------------------------------------------------------------------------------------------------------------- 249 | 250 | for (membership in membership_cutoff){ 251 | membership=as.numeric(membership) 252 | # create one output folder per membership 253 | dir=paste(output,paste("cluster_with_membership",membership, sep="_"),sep="/") 254 | dir.create(dir, showWarnings = FALSE) 255 | #-------------------------------------------------------------------------------------------------------------------------- 256 | # membership cut-off part and plot clusters 257 | pdf(paste(dir,paste(paste("clusters_Mfuzz_membership_equals_",membership,sep=""),".pdf",sep=""), sep="/")) 258 | mfuzz.plot2(exprSet.s,cl=cl,time.labels=unique(time),min.mem=membership, colo="fancy", x11=FALSE) 259 | dev.off() 260 | #-------------------------------------------------------------------------------------------------------------------------- 261 | 262 | #-------------------------------------------------------------------------------------------------------------------------- 263 | # generates one genes list per cluster 264 | acore.list=acore(exprSet.s,cl=cl,min.acore=membership) 265 | print("----") 266 | print(paste("Membership",membership,sep=" : ")) 267 | for (cluster in 1:nb_clusters){ 268 | print(paste(paste("Number of genes in cluster", cluster, sep=" "),dim(acore.list[[cluster]])[1], sep=" : ")) 269 | cluster_table=merge(alldata,acore.list[[cluster]][2], by="row.names", all.y=TRUE) 270 | write.table(cluster_table,paste(dir,paste(paste("list_of_genes_in_cluster",cluster,sep="_"),".txt"),sep="/"), sep="\t",row.names=F, dec=".") 271 | } 272 | print("----") 273 | #-------------------------------------------------------------------------------------------------------------------------- 274 | } 275 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Mfuzz_RNAseq.R 2 | A R script to perform clustering of gene expression time-series RNA-seq data with Mfuzz. 3 | 4 | Required R libraries : 5 | optparse, tools, Mfuzz, GenomicFeatures, DESeq, edgeR 6 | 7 | Mfuzz webpage : http://mfuzz.sysbiolab.eu/ 8 | Mfuzz paper : http://w3.ualg.pt/%7Emfutschik/publications/bioinformation.pdf 9 | 10 | Mfuzz_RNAseq.R take as input a set of RNA-seq count tables, one per sample, from HTSeq-count for example. All the RNA-seq count tables must be contain in a same folder, given in input of the script. 11 | 12 | For example, a folder containing four count data files : Sample1.txt,Sample2.txt,Sample3.txt,Sample4.txt 13 | 14 | Sample1.txt contains the following data, without header : 15 | 16 | | GeneID1 | S1Count1 | 17 | | ---| --- 18 | | GeneID2 | S1Count2 | 19 | | GeneID3 | S1Count3 | 20 | | GeneID4 | S1Count4 | 21 | 22 | And Mfuzz_RNAseq.R read all the file and generates : 23 | 24 | | GeneID | Sample1 | Sample2 | Sample3 | Sample4 | 25 | --- | --- | ---| ---| --- 26 | | GeneID2 | S1Count2 | S2Count2 | S3Count2 | S4Count2 | 27 | | GeneID3 | S1Count3 | S2Count3 | S3Count3 | S4Count3 | 28 | | GeneID4 | S1Count4 | S2Count4 | S3Count4 | S4Count4 | 29 | 30 | From this table, Mfuzz_RNAseq.R performs a complete RNAseq data normalization and then uses Mfuzz package to perform a soft clustering of gene expression time-series data. 31 | 32 | Normalization steps : From the input count tables, the Mfuzz_RNAseq.R script performs a library size normalization with DESeq method and then adjust these normalized data for gene length (normalized data / gene length). These normalization steps are carried out to make all the samples comparable, which is required by Mfuzz package. 33 | 34 | Soft clustering steps : With these last normalized data (called RPKN data), the Mfuzz_RNAseq.R script performs a genes clustering analysis with Mfuzz package, generating clusters and associated genes lists. 35 | 36 | This script has three principal inputs : 37 | - the argument "--folder" or "-f" which is the directory containing all the RNA-seq count tables (and only these files). 38 | Mfuzz_RNAseq.R will read and merge all these tables and will perform the normalization steps. 39 | - the argument "--annotation" or "-a" is the path to an genes/transcripts annotation file (gff or gtf format), allowing 40 | to calculate the genes length (sum of the exons length, overlap of exons is take into account). This lengths are used 41 | during the data normalization by gene length. 42 | - the argument "--time" or "-t" give the time value of each file by respecting the same order in the vector than the files in 43 | the folder. This is a list of type 'time1,time1,time1,time2,time2,time2,time3'. 44 | If several files correspond to a same time (replicates), give the same time value and then the script performs the mean on 45 | the normalized counts of all the samples of a same time to perform the soft clustering. 46 | 47 | For a description of optional arguments, type : 48 | /usr/bin/Rscript Mfuzz_RNAseq.R -h 49 | 50 | Minimal command: 51 | /usr/bin/Rscript Mfuzz_RNAseq.R -f count_files_folder -a annotation -t time 52 | 53 | Complete command: 54 | /usr/bin/Rscript Mfuzz_RNAseq.R -f count_files_folder -a annotation -b gene_name_attribute -t time -n nb_clusters 55 | -m membership_cutoff -s min_std -e exclude_thres -r replacement_mode -o output_directory 56 | --------------------------------------------------------------------------------