├── .gitignore ├── README.md ├── CRISPRiaDesign_example_notebook.md ├── CRISPRiaDesign_example_notebook.ipynb ├── sgRNA_learning.py └── Library_design_walkthrough.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | #files too large for github 7 | large_data_files/ 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | env/ 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *,cover 49 | .hypothesis/ 50 | 51 | # Translations 52 | *.mo 53 | *.pot 54 | 55 | # Django stuff: 56 | *.log 57 | local_settings.py 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # IPython Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # dotenv 82 | .env 83 | 84 | # virtualenv 85 | venv/ 86 | ENV/ 87 | 88 | # Spyder project settings 89 | .spyderproject 90 | 91 | # Rope project settings 92 | .ropeproject 93 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CRISPRiaDesign 2 | 3 | This site hosts the sgRNA machine learning scripts used to generate the Weissman lab's next-generation CRISPRi and CRISPRa library designs [(Horlbeck et al., eLife 2016)](https://elifesciences.org/content/5/e19760). These are currently implemented as interactive scripts along with iPython notebooks with step-by-step instructions for creating new sgRNA libraries. Future plans include adding command line functions to make library design more user-friendly. Note that all sgRNA designs for CRISPRi/a human/mouse protein-coding gene libraries are included as supplementary tables in the eLife paper, so cloning of individual sgRNAs or construction of any custom sublibraries targeting protein-coding genes can simply refer to those tables. These scripts are primarily useful for the design of sgRNAs targeting novel or non-coding genes, or for organisms beyond human and mouse. 4 | 5 | **To apply the exact quantitative models used to generate the CRISPRi-v2 or CRISPRa-v2 libraries**, follow the steps outlined in the Library_design_walkthrough (included as a Jupyter notebook or [web page](Library_design_walkthrough.md)). 6 | 7 | To see full example code for de novo machine learning, prediction of sgRNA activity for desired loci, and construction of new genome-scale CRISPRi/a libraries, see the CRISPRiaDesign_example_notebook (included as Jupyter notebook or [web page](CRISPRiaDesign_example_notebook.md)). 8 | 9 | ### Dependencies 10 | * Python v2.7 11 | * Jupyter notebook 12 | * Biopython 13 | * Scipy/Numpy/Pandas 14 | * Scikit-learn 15 | * bxpython (v0.5.0, https://github.com/bxlab/bx-python) 16 | * Pysam 17 | * [ScreenProcessing](https://github.com/mhorlbeck/ScreenProcessing) 18 | 19 | External command line applications required: 20 | * ViennaRNA 21 | * Bowtie (not Bowtie2) 22 | 23 | Large genomic data files required: 24 | 25 | Links are to human genome files relied upon for the hCRISPRi-v2 and hCRISPRa-v2 machine learning--and required for the Library_design_walkthrough--but any organism/assembly may be used for design of new libraries or de novo machine learning. For convenience, **the files referenced in Library_design_walkthrough in the folder "large_data_files" are also available [here](https://ucsf.box.com/s/s4ds471in2ngjer7okavzf5cqf2ebrqj)**. 26 | 27 | * Genome sequence as FASTA ([hg19](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/)) 28 | * FANTOM5 TSS annotation as BED ([TSS_human](http://fantom.gsc.riken.jp/5/datafiles/phase1.3/extra/TSS_classifier/)) 29 | * Chromatin data as BigWig ([MNase](https://www.encodeproject.org/files/ENCFF000VNN/), [DNase](https://www.encodeproject.org/files/ENCFF000SVL/), [FAIRE-seq](https://www.encodeproject.org/files/ENCFF000TLU/)) 30 | * HGNC table of gene aliases (not strictly required for the Library_design_walkthrough but useful in some steps) 31 | * Ensembl annotation as GTF (not strictly required for the Library_design_walkthrough but useful in some steps and in other functions; release 74 used for the published library designs) 32 | -------------------------------------------------------------------------------- /CRISPRiaDesign_example_notebook.md: -------------------------------------------------------------------------------- 1 | 2 | 1. Learning sgRNA predictors from empirical data 3 | * Load scripts and empirical data 4 | * Generate TSS annotation using FANTOM dataset 5 | * Calculate parameters for empirical sgRNAs 6 | * Fit parameters 7 | 2. Applying machine learning model to predict sgRNA activity 8 | * Find all sgRNAs in genomic regions of interest 9 | * Predicting sgRNA activity 10 | 3. Construct sgRNA libraries 11 | * Score sgRNAs for off-target potential 12 | * Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering 13 | * Design negative controls matching the base composition of the library 14 | * Finalizing library design 15 | 16 | # 1. Learning sgRNA predictors from empirical data 17 | ## Load scripts and empirical data 18 | 19 | 20 | ```python 21 | %run sgRNA_learning.py 22 | ``` 23 | 24 | 25 | ```python 26 | genomeDict = loadGenomeAsDict(FASTA_FILE_OF_GENOME) 27 | gencodeData = loadGencodeData(GTF_FILE_FROM_GENCODE) 28 | ``` 29 | 30 | 31 | ```python 32 | #load empirical data as tables in the format generated by github.com/mhorlbeck/ScreenProcessing 33 | libraryTable, phenotypeTable, geneTable = loadExperimentData(PATHS_TO_DATA_GENERATED_BY_ScreenProcessing) 34 | ``` 35 | 36 | 37 | ```python 38 | #extract genes that scored as hits, normalize phenotypes, and extract information on sgRNAs from the sgIDs 39 | discriminantTable = calculateDiscriminantScores(geneTable) 40 | normedScores, maxDiscriminantTable = getNormalizedsgRNAsOverThresh(libraryTable, phenotypeTable, discriminantTable, 41 | DISCRIMANT_THRESHOLD_eg20, 42 | 3, transcripts=False) 43 | 44 | libraryTable_subset = libraryTable.loc[normedScores.dropna().index] 45 | sgInfoTable = parseAllSgIds(libraryTable_subset) 46 | ``` 47 | 48 | ## Generate TSS annotation using FANTOM dataset 49 | 50 | 51 | ```python 52 | #first generates a table of TSS annotations 53 | #legacy function to make an intermediate table for the "P1P2" annotation strategy, will be replaced in future versions 54 | #TSS_TABLE_BASED_ON_ENSEMBL is table without headers with columns: 55 | #gene, transcript, chromosome, TSS coordinate, strand, annotation_source(optional) 56 | tssTable = generateTssTable(geneTable, TSS_TABLE_BASED_ON_ENSEMBL, FANTOM_TSS_ANNOTATION_BED, 200) 57 | ``` 58 | 59 | 60 | ```python 61 | #Now create a TSS annotation by searching for P1 and P2 peaks near annotated TSSs 62 | geneToAliases = generateAliasDict(HGNC_SYMBOL_LOOKUP_TABLE,gencodeData) 63 | p1p2Table = generateTssTable_P1P2strategy(tssTable.loc[tssTable.apply(lambda row: row.name[0][:6] != 'pseudo',axis=1)], 64 | FANTOM_TSS_ANNOTATION_BED, 65 | matchedp1p2Window = 30000, #region around supplied TSS annotation to search for a FANTOM P1 or P2 peak that matches the gene name (or alias) 66 | anyp1p2Window = 500, #region around supplied TSS annotation to search for the nearest P1 or P2 peak 67 | anyPeakWindow = 200, #region around supplied TSS annotation to search for any CAGE peak 68 | minDistanceForTwoTSS = 1000, #If a P1 and P2 peak are found, maximum distance at which to combine into a single annotation (with primary/secondary TSS positions) 69 | aliasDict = geneToAliases[0]) 70 | #the function will report some collisions of IDs due to use of aliases and redundancy in genome, but will resolve these itself 71 | ``` 72 | 73 | 74 | ```python 75 | #save tables for downstream use 76 | tssTable = pd.read_csv(TSS_TABLE_PATH,sep='\t', index_col=range(2)) 77 | p1p2Table = pd.read_csv(P1P2_TABLE_PATH,sep='\t', header=0, index_col=range(2)) 78 | ``` 79 | 80 | ## Calculate parameters for empirical sgRNAs 81 | 82 | 83 | ```python 84 | #Load bigwig files for any chromatin data of interest 85 | bwhandleDict = {'dnase':BigWigFile(open('ENCODE_data/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')), 86 | 'faire':BigWigFile(open('ENCODE_data/wgEncodeOpenChromFaireK562Sig.bigWig')), 87 | 'mnase':BigWigFile(open('ENCODE_data/wgEncodeSydhNsomeK562Sig.bigWig'))} 88 | ``` 89 | 90 | 91 | ```python 92 | paramTable_trainingGuides = generateTypicalParamTable(libraryTable_subset,sgInfoTable, tssTable, p1p2Table, genomeDict, bwhandleDict) 93 | ``` 94 | 95 | ## Fit parameters 96 | 97 | 98 | ```python 99 | #populate table of fitting parameters 100 | typeList = ['binnable_onehot', 101 | 'continuous', 'continuous', 'continuous', 'continuous', 102 | 'binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot', 103 | 'binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot', 104 | 'binary'] 105 | typeList.extend(['binary']*160) 106 | typeList.extend(['binary']*(16*38)) 107 | typeList.extend(['binnable_onehot']*3) 108 | typeList.extend(['binnable_onehot']*2) 109 | typeList.extend(['binary']*18) 110 | fitTable = pd.DataFrame(typeList, index=paramTable_trainingGuides.columns, columns=['type']) 111 | fitparams =[{'bin width':1, 'min edge data':50, 'bin function':np.median}, 112 | {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]}, 113 | {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]}, 114 | {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]}, 115 | {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]}, 116 | {'bin width':1, 'min edge data':50, 'bin function':np.median}, 117 | {'bin width':1, 'min edge data':50, 'bin function':np.median}, 118 | {'bin width':1, 'min edge data':50, 'bin function':np.median}, 119 | {'bin width':1, 'min edge data':50, 'bin function':np.median}, 120 | {'bin width':.1, 'min edge data':50, 'bin function':np.median}, 121 | {'bin width':.1, 'min edge data':50, 'bin function':np.median}, 122 | {'bin width':.1, 'min edge data':50, 'bin function':np.median}, 123 | {'bin width':.1, 'min edge data':50, 'bin function':np.median}, 124 | {'bin width':.1, 'min edge data':50, 'bin function':np.median}, 125 | {'bin width':.1, 'min edge data':50, 'bin function':np.median}, 126 | {'bin width':.1, 'min edge data':50, 'bin function':np.median},dict()] 127 | fitparams.extend([dict()]*160) 128 | fitparams.extend([dict()]*(16*38)) 129 | fitparams.extend([ 130 | {'bin width':.15, 'min edge data':50, 'bin function':np.median}, 131 | {'bin width':.15, 'min edge data':50, 'bin function':np.median}, 132 | {'bin width':.15, 'min edge data':50, 'bin function':np.median}]) 133 | fitparams.extend([ 134 | {'bin width':2, 'min edge data':50, 'bin function':np.median}, 135 | {'bin width':2, 'min edge data':50, 'bin function':np.median}]) 136 | fitparams.extend([dict()]*18) 137 | fitTable['params'] = fitparams 138 | ``` 139 | 140 | 141 | ```python 142 | #divide empirical data into n-folds for cross-validation 143 | geneFoldList = getGeneFolds(libraryTable_subset, 5, transcripts=False) 144 | ``` 145 | 146 | 147 | ```python 148 | #for each fold, fit parameters to training folds and measure ROC on test fold 149 | coefs = [] 150 | scoreTups = [] 151 | transformedParamTups = [] 152 | 153 | for geneFold_train, geneFold_test in geneFoldList: 154 | 155 | transformedParams_train, estimators = fitParams(paramTable_trainingGuides.loc[normedScores.dropna().index].iloc[geneFold_train], normedScores.loc[normedScores.dropna().index].iloc[geneFold_train], fitTable) 156 | 157 | transformedParams_test = transformParams(paramTable_trainingGuides.loc[normedScores.dropna().index].iloc[geneFold_test], fitTable, estimators) 158 | 159 | reg = linear_model.ElasticNetCV(l1_ratio=[.5, .75, .9, .99,1], n_jobs=16, max_iter=2000) 160 | 161 | scaler = preprocessing.StandardScaler() 162 | reg.fit(scaler.fit_transform(transformedParams_train), normedScores.loc[normedScores.dropna().index].iloc[geneFold_train]) 163 | predictedScores = pd.Series(reg.predict(scaler.transform(transformedParams_test)), index=transformedParams_test.index) 164 | testScores = normedScores.loc[normedScores.dropna().index].iloc[geneFold_test] 165 | 166 | transformedParamTups.append((scaler.transform(transformedParams_train),scaler.transform(transformedParams_test))) 167 | scoreTups.append((testScores, predictedScores)) 168 | 169 | print 'Prediction AUC-ROC:', metrics.roc_auc_score((testScores >= .75).values, np.array(predictedScores.values,dtype='float64')) 170 | print 'Prediction R^2:', reg.score(scaler.transform(transformedParams_test), testScores) 171 | print 'Regression parameters:', reg.l1_ratio_, reg.alpha_ 172 | coefs.append(pd.DataFrame(zip(*[abs(reg.coef_),reg.coef_]), index = transformedParams_test.columns, columns=['abs','true'])) 173 | print 'Number of features used:', len(coefs[-1]) - sum(coefs[-1]['abs'] < .00000000001) 174 | ``` 175 | 176 | 177 | ```python 178 | #can select an arbitrary fold (as shown here simply the last one tested) to save state for reproducing estimators later 179 | #the pickling of the scikit-learn estimators/regressors will allow the model to be reloaded for prediction of other guide designs, 180 | # but will not be compatible across scikit-learn versions, so it is important to preserve the training data and training/test folds 181 | import cPickle 182 | estimatorString = cPickle.dumps((fitTable, estimators, scaler, reg, (geneFold_train, geneFold_test))) 183 | with open(PICKLE_FILE,'w') as outfile: 184 | outfile.write(estimatorString) 185 | 186 | #also save the transformed parameters as these can slightly differ based on the automated binning strategy 187 | transformedParams_train.head().to_csv(TRANSFORMED_PARAM_HEADER,sep='\t') 188 | ``` 189 | 190 | # 2. Applying machine learning model to predict sgRNA activity 191 | 192 | 193 | ```python 194 | #starting from a new session for demonstration purposes: 195 | %run sgRNA_learning.py 196 | import cPickle 197 | 198 | #load tssTable, p1p2Table, genome sequence, chromatin data 199 | tssTable = pd.read_csv(TSS_TABLE_PATH,sep='\t', index_col=range(2)) 200 | 201 | p1p2Table = pd.read_csv(P1P2_TABLE_PATH,sep='\t', header=0, index_col=range(2)) 202 | p1p2Table['primary TSS'] = p1p2Table['primary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0]), int(tupString.strip('()').split(', ')[1]))) 203 | p1p2Table['secondary TSS'] = p1p2Table['secondary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0]),int(tupString.strip('()').split(', ')[1]))) 204 | 205 | genomeDict = loadGenomeAsDict(FASTA_FILE_OF_GENOME) 206 | 207 | bwhandleDict = {'dnase':BigWigFile(open('ENCODE_data/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')), 208 | 'faire':BigWigFile(open('ENCODE_data/wgEncodeOpenChromFaireK562Sig.bigWig')), 209 | 'mnase':BigWigFile(open('ENCODE_data/wgEncodeSydhNsomeK562Sig.bigWig'))} 210 | 211 | #load sgRNA prediction model saved after the parameter fitting step 212 | with open(PICKLE_FILE) as infile: 213 | fitTable, estimators, scaler, reg, (geneFold_train, geneFold_test) = cPickle.load(infile) 214 | 215 | transformedParamHeader = pd.read_csv(TRANSFORMED_PARAM_HEADER,sep='\t') 216 | ``` 217 | 218 | ## Find all sgRNAs in genomic regions of interest 219 | 220 | 221 | ```python 222 | #use the same p1p2Table as above or generate a new one for novel TSSs 223 | libraryTable_new, sgInfoTable_new = findAllGuides(p1p2Table, genomeDict, (-25,500)) 224 | ``` 225 | 226 | 227 | ```python 228 | #alternately, load tables of sgRNAs to score: 229 | libraryTable_new = pd.read_csv(LIBRARY_TABLE_PATH,sep='\t',index_col=0) 230 | sgInfoTable_new = pd.read_csv(SGINFO_TABLE_PATH,sep='\t',index_col=0) 231 | ``` 232 | 233 | ## Predicting sgRNA activity 234 | 235 | 236 | ```python 237 | #calculate parameters for new sgRNAs 238 | paramTable_new = generateTypicalParamTable(libraryTable_new, sgInfoTable_new, tssTable, p1p2Table, genomeDict, bwhandleDict) 239 | ``` 240 | 241 | 242 | ```python 243 | #transform and predict scores according to sgRNA prediction model 244 | transformedParams_new = transformParams(paramTable_new, fitTable, estimators) 245 | 246 | #reconcile any differences in column headers generated by automated binning 247 | colTups = [] 248 | for (l1, l2), col in transformedParams_new.iteritems(): 249 | colTups.append((l1,str(l2))) 250 | transformedParams_new.columns = pd.MultiIndex.from_tuples(colTups) 251 | 252 | predictedScores_new = pd.Series(reg.predict(scaler.transform(transformedParams_new.loc[:, transformedParamHeader.columns].fillna(0).values)), index=transformedParams_new.index) 253 | ``` 254 | 255 | 256 | ```python 257 | predictedScores_new.to_csv(PREDICTED_SCORE_TABLE, sep='\t') 258 | ``` 259 | 260 | # 3. Construct sgRNA libraries 261 | ## Score sgRNAs for off-target potential 262 | 263 | 264 | ```python 265 | #There are many ways to score sgRNAs as off-target; below is one listed one method that is simple and flexible, 266 | #but ignores gapped alignments, alternate PAMs, and uses bowtie which may not be maximally sensitive in all cases 267 | ``` 268 | 269 | 270 | ```python 271 | #output all sequences to a temporary FASTQ file for running bowtie alignment 272 | def outputTempBowtieFastq(libraryTable, outputFileName): 273 | phredString = 'I4!=======44444+++++++' #weighting for how impactful mismatches are along sgRNA sequence 274 | with open(outputFileName,'w') as outfile: 275 | for name, row in libraryTable.iterrows(): 276 | outfile.write('@' + name + '\n') 277 | outfile.write('CCN' + str(Seq.Seq(row['sequence'][1:]).reverse_complement()) + '\n') 278 | outfile.write('+\n') 279 | outfile.write(phredString + '\n') 280 | 281 | outputTempBowtieFastq(libraryTable_new, TEMP_FASTQ_FILE) 282 | ``` 283 | 284 | 285 | ```python 286 | import subprocess 287 | fqFile = TEMP_FASTQ_FILE 288 | 289 | #specifying a list of parameters to run bowtie with 290 | #each tuple contains 291 | # *the mismatch threshold below which a site is considered a potential off-target (higher is more stringent) 292 | # *the number of sites allowed (1 is minimum since each sgRNA should have one true site in genome) 293 | # *the genome index against which to align the sgRNA sequences; these can be custom built to only consider sites near TSSs 294 | # *a name for the bowtie run to create appropriately named output files 295 | alignmentList = [(39,1,'~/indices/hg19.ensemblTSSflank500b','39_nearTSS'), 296 | (31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS'), 297 | (21,1,'~/indices/hg19.maskChrMandPAR','21_genome'), 298 | (31,2,'~/indices/hg19.ensemblTSSflank500b','31_2_nearTSS'), 299 | (31,3,'~/indices/hg19.ensemblTSSflank500b','31_3_nearTSS')] 300 | 301 | alignmentColumns = [] 302 | for btThreshold, mflag, bowtieIndex, runname in alignmentList: 303 | 304 | alignedFile = 'bowtie_output/' + runname + '_aligned.txt' 305 | unalignedFile = 'bowtie_output/' + runname + '_unaligned.fq' 306 | maxFile = 'bowtie_output/' + runname + '_max.fq' 307 | 308 | bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile 309 | print bowtieString 310 | print subprocess.call(bowtieString, shell=True) 311 | 312 | #parse through the file of sgRNAs that exceeded "m", the maximum allowable alignments, and mark "True" any that are found 313 | with open(maxFile) as infile: 314 | sgsAligning = set() 315 | for i, line in enumerate(infile): 316 | if i%4 == 0: #id line 317 | sgsAligning.add(line.strip()[1:]) 318 | 319 | alignmentColumns.append(libraryTable_new.apply(lambda row: row.name in sgsAligning, axis=1)) 320 | 321 | #collate results into a table, and flip the boolean values to yield the sgRNAs that passed filter as True 322 | alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3]).ne(True) 323 | ``` 324 | 325 | ## Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering 326 | 327 | 328 | ```python 329 | #combine all generated data into one master table 330 | predictedScores_new.name = 'predicted score' 331 | v2Table = pd.concat((libraryTable_new, predictedScores_new, alignmentTable, sgInfoTable_new), axis=1, keys=['library table v2', 'predicted score', 'off-target filters', 'sgRNA info']) 332 | ``` 333 | 334 | 335 | ```python 336 | import re 337 | #for our pCRISPRi/a-v2 vector, we append flanking sequences to each sgRNA sequence for cloning and require the oligo to contain 338 | #exactly 1 BstXI and BlpI site each for cloning, and exactly 0 SbfI sites for sequencing sample preparation 339 | restrictionSites = {re.compile('CCA......TGG'):1, 340 | re.compile('GCT.AGC'):1, 341 | re.compile('CCTGCAGG'):0} 342 | 343 | def matchREsites(sequence, REdict): 344 | seq = sequence.upper() 345 | for resite, numMatchesExpected in restrictionSites.iteritems(): 346 | if len(resite.findall(seq)) != numMatchesExpected: 347 | return False 348 | 349 | return True 350 | 351 | def checkOverlaps(leftPosition, acceptedLeftPositions, nonoverlapMin): 352 | for pos in acceptedLeftPositions: 353 | if abs(pos - leftPosition) < nonoverlapMin: 354 | return False 355 | return True 356 | ``` 357 | 358 | 359 | ```python 360 | #flanking sequences 361 | upstreamConstant = 'CCACCTTGTTG' 362 | downstreamConstant = 'GTTTAAGAGCTAAGCTG' 363 | 364 | #minimum overlap between two sgRNAs targeting the same TSS 365 | nonoverlapMin = 3 366 | 367 | #number of sgRNAs to pick per gene/TSS 368 | sgRNAsToPick = 10 369 | 370 | #list of off-target filter (or combinations of filters) levels, matching the names in the alignment table above 371 | offTargetLevels = [['31_nearTSS', '21_genome'], 372 | ['31_nearTSS'], 373 | ['21_genome'], 374 | ['31_2_nearTSS'], 375 | ['31_3_nearTSS']] 376 | 377 | #for each gene/TSS, go through each sgRNA in descending order of predicted score 378 | #if an sgRNA passes the restriction site, overlap, and off-target filters, accept it into the library 379 | #if the number of sgRNAs accepted is less than sgRNAsToPick, reduce off-target stringency by one and continue 380 | v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')]) 381 | newSgIds = [] 382 | unfinishedTss = [] 383 | for (gene, transcript), group in v2Groups: 384 | geneSgIds = [] 385 | geneLeftPositions = [] 386 | empiricalSgIds = dict() 387 | 388 | stringency = 0 389 | 390 | while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels): 391 | for sgId_v2, row in group.sort(('predicted score','predicted score'), ascending=False).iterrows(): 392 | oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant 393 | leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0) 394 | if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \ 395 | and matchREsites(oligoSeq, restrictionSites) \ 396 | and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin): 397 | geneSgIds.append((sgId_v2, 398 | gene,transcript, 399 | row[('library table v2','sequence')], oligoSeq, 400 | row[('predicted score','predicted score')], np.nan, 401 | stringency)) 402 | geneLeftPositions.append(leftPos) 403 | 404 | stringency += 1 405 | 406 | if len(geneSgIds) < sgRNAsToPick: 407 | unfinishedTss.append((gene, transcript)) #if the number of accepted sgRNAs is still less than sgRNAsToPick, discard gene 408 | else: 409 | newSgIds.extend(geneSgIds) 410 | 411 | libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence', 412 | 'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID') 413 | ``` 414 | 415 | 416 | ```python 417 | #number of sgRNAs accepted at each stringency level 418 | newLibraryTable.groupby('off-target stringency').agg(len).iloc[:,0] 419 | ``` 420 | 421 | 422 | ```python 423 | #number of TSSs with fewer than required number of sgRNAs (and thus not included in the library) 424 | print len(unfinishedTss) 425 | ``` 426 | 427 | 428 | ```python 429 | #Note that empirical information from previous screens can be included as well--for example: 430 | geneToDisc = maxDiscriminantTable['best score'].groupby(level=0).agg(max).to_dict() 431 | thresh = 7 432 | empiricalBonus = .2 433 | 434 | upstreamConstant = 'CCACCTTGTTG' 435 | downstreamConstant = 'GTTTAAGAGCTAAGCTG' 436 | 437 | nonoverlapMin = 3 438 | 439 | sgRNAsToPick = 10 440 | 441 | offTargetLevels = [['31_nearTSS', '21_genome'], 442 | ['31_nearTSS'], 443 | ['21_genome'], 444 | ['31_2_nearTSS'], 445 | ['31_3_nearTSS']] 446 | offTargetLevels_v1 = [[s + '_v1' for s in l] for l in offTargetLevels] 447 | 448 | v1Groups = v1Table.groupby([('relative position','gene'),('relative position','transcript')]) 449 | v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')]) 450 | 451 | newSgIds = [] 452 | unfinishedTss = [] 453 | for (gene, transcript), group in v2Groups: 454 | geneSgIds = [] 455 | geneLeftPositions = [] 456 | empiricalSgIds = dict() 457 | 458 | stringency = 0 459 | 460 | while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels): 461 | 462 | if gene in geneToDisc and geneToDisc[gene] >= thresh and (gene, transcript) in v1Groups.groups: 463 | 464 | for sgId_v1, row in v1Groups.get_group((gene, transcript)).sort(('Empirical activity score','Empirical activity score'),ascending=False).iterrows(): 465 | oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant 466 | leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0) 467 | if len(geneSgIds) < sgRNAsToPick and min(abs(row.loc['relative position'].iloc[2:])) < 5000 \ 468 | and row[('Empirical activity score','Empirical activity score')] >= .75 \ 469 | and row['off-target filters'].loc[offTargetLevels_v1[stringency]].all() \ 470 | and matchREsites(oligoSeq, restrictionSites) \ 471 | and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin): 472 | if len(geneSgIds) < 2: 473 | geneSgIds.append((row[('library table v2','sgId_v2')], 474 | gene,transcript, 475 | row[('library table v2','sequence')], oligoSeq, 476 | np.nan,row[('Empirical activity score','Empirical activity score')], 477 | stringency)) 478 | geneLeftPositions.append(leftPos) 479 | 480 | empiricalSgIds[row[('library table v2','sgId_v2')]] = row[('Empirical activity score','Empirical activity score')] 481 | 482 | adjustedScores = group.apply(lambda row: row[('predicted score','CRISPRiv2 predicted score')] + empiricalBonus if row.name in empiricalSgIds else row[('predicted score','CRISPRiv2 predicted score')], axis=1) 483 | adjustedScores.name = ('adjusted score','') 484 | for sgId_v2, row in pd.concat((group,adjustedScores),axis=1).sort(('adjusted score',''), ascending=False).iterrows(): 485 | oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant 486 | leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0) 487 | if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \ 488 | and matchREsites(oligoSeq, restrictionSites) \ 489 | and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin): 490 | geneSgIds.append((sgId_v2, 491 | gene,transcript, 492 | row[('library table v2','sequence')], oligoSeq, 493 | row[('predicted score','CRISPRiv2 predicted score')], empiricalSgIds[sgId_v2] if sgId_v2 in empiricalSgIds else np.nan, 494 | stringency)) 495 | geneLeftPositions.append(leftPos) 496 | 497 | stringency += 1 498 | 499 | if len(geneSgIds) < sgRNAsToPick: 500 | unfinishedTss.append((gene, transcript)) 501 | else: 502 | newSgIds.extend(geneSgIds) 503 | 504 | 505 | libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence', 506 | 'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID') 507 | ``` 508 | 509 | ## Design negative controls matching the base composition of the library 510 | 511 | 512 | ```python 513 | #calcluate the base frequency at each position of the sgRNA, then generate random sequences weighted by this frequency 514 | def getBaseFrequencies(libraryTable, baseConversion = {'G':0, 'C':1, 'T':2, 'A':3}): 515 | baseArray = np.zeros((len(libraryTable),20)) 516 | 517 | for i, (index, seq) in enumerate(libraryTable['protospacer sequence'].iteritems()): 518 | for j, char in enumerate(seq.upper()): 519 | baseArray[i,j] = baseConversion[char] 520 | 521 | baseTable = pd.DataFrame(baseArray, index = libraryTable.index) 522 | 523 | baseFrequencies = baseTable.apply(lambda col: col.groupby(col).agg(len)).fillna(0) / len(baseTable) 524 | baseFrequencies.index = ['G','C','T','A'] 525 | 526 | baseCumulativeFrequencies = baseFrequencies.copy() 527 | baseCumulativeFrequencies.loc['C'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] 528 | baseCumulativeFrequencies.loc['T'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T'] 529 | baseCumulativeFrequencies.loc['A'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T'] + baseFrequencies.loc['A'] 530 | 531 | return baseFrequencies, baseCumulativeFrequencies 532 | 533 | def generateRandomSequence(baseCumulativeFrequencies): 534 | randArray = np.random.random(baseCumulativeFrequencies.shape[1]) 535 | 536 | seq = [] 537 | for i, col in baseCumulativeFrequencies.iteritems(): 538 | for base, freq in col.iteritems(): 539 | if randArray[i] < freq: 540 | seq.append(base) 541 | break 542 | 543 | return ''.join(seq) 544 | ``` 545 | 546 | 547 | ```python 548 | baseCumulativeFrequencies = getBaseFrequencies(libraryTable_complete)[1] 549 | negList = [] 550 | for i in range(30000): 551 | negList.append(generateRandomSequence(baseCumulativeFrequencies)) 552 | negTable = pd.DataFrame(negList, index=['non-targeting_' + str(i) for i in range(30000)], columns = ['sequence']) 553 | 554 | outputTempBowtieFastq(negTable, TEMP_FASTQ_FILE) 555 | ``` 556 | 557 | 558 | ```python 559 | #similar to targeting sgRNA off-target scoring, but looking for sgRNAs with 0 alignments 560 | fqFile = TEMP_FASTQ_FILE 561 | 562 | alignmentList = [(31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS_negs'), 563 | (21,1,'~/indices/hg19.maskChrMandPAR','21_genome_negs')] 564 | 565 | alignmentColumns = [] 566 | for btThreshold, mflag, bowtieIndex, runname in alignmentList: 567 | 568 | alignedFile = 'bowtie_output/' + runname + '_aligned.txt' 569 | unalignedFile = 'bowtie_output//' + runname + '_unaligned.fq' 570 | maxFile = 'bowtie_output/' + runname + '_max.fq' 571 | 572 | bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile 573 | print bowtieString 574 | print subprocess.call(bowtieString, shell=True) 575 | 576 | #read unaligned file for negs, and then don't flip boolean of alignmentTable 577 | with open(unalignedFile) as infile: 578 | sgsAligning = set() 579 | for i, line in enumerate(infile): 580 | if i%4 == 0: #id line 581 | sgsAligning.add(line.strip()[1:]) 582 | 583 | alignmentColumns.append(negTable.apply(lambda row: row.name in sgsAligning, axis=1)) 584 | 585 | alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3]) 586 | alignmentTable.head() 587 | ``` 588 | 589 | 590 | ```python 591 | acceptedNegList = [] 592 | negCount = 0 593 | for i, (name, row) in enumerate(pd.concat((negTable,alignmentTable),axis=1, keys=['seq','alignment']).iterrows()): 594 | oligo = upstreamConstant + row['seq','sequence'] + downstreamConstant 595 | if row['alignment'].all() and matchREsites(oligo, restrictionSites): 596 | acceptedNegList.append(('non-targeting_%05d' % negCount, 'negative_control', 'na', row['seq','sequence'], oligo, 0)) 597 | negCount += 1 598 | 599 | acceptedNegs = pd.DataFrame(acceptedNegList, columns = ['sgId', 'gene', 'transcript', 'protospacer sequence', 'oligo sequence', 'off-target stringency']).set_index('sgId') 600 | ``` 601 | 602 | ## Finalizing library design 603 | 604 | * divide genes into sublibrary groups (if required) 605 | * assign negative control sgRNAs to sublibrary groups; ~1-2% of the number of sgRNAs in the library is a good rule-of-thumb 606 | * append PCR adapter sequences (~18bp) to each end of the oligo sequences to enable amplification of the oligo pool; each sublibary should have an orthogonal sequence so they can be cloned separately 607 | 608 | 609 | ```python 610 | 611 | ``` 612 | -------------------------------------------------------------------------------- /CRISPRiaDesign_example_notebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "1. Learning sgRNA predictors from empirical data\n", 8 | " * Load scripts and empirical data\n", 9 | " * Generate TSS annotation using FANTOM dataset\n", 10 | " * Calculate parameters for empirical sgRNAs\n", 11 | " * Fit parameters\n", 12 | "2. Applying machine learning model to predict sgRNA activity\n", 13 | " * Find all sgRNAs in genomic regions of interest \n", 14 | " * Predicting sgRNA activity\n", 15 | "3. Construct sgRNA libraries\n", 16 | " * Score sgRNAs for off-target potential\n", 17 | "* Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering\n", 18 | "* Design negative controls matching the base composition of the library\n", 19 | "* Finalizing library design" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "# 1. Learning sgRNA predictors from empirical data\n", 27 | "## Load scripts and empirical data" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": { 34 | "collapsed": true 35 | }, 36 | "outputs": [], 37 | "source": [ 38 | "%run sgRNA_learning.py" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": { 45 | "collapsed": true 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "genomeDict = loadGenomeAsDict(FASTA_FILE_OF_GENOME)\n", 50 | "gencodeData = loadGencodeData(GTF_FILE_FROM_GENCODE)" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": { 57 | "collapsed": true 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "#load empirical data as tables in the format generated by github.com/mhorlbeck/ScreenProcessing\n", 62 | "libraryTable, phenotypeTable, geneTable = loadExperimentData(PATHS_TO_DATA_GENERATED_BY_ScreenProcessing)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": { 69 | "collapsed": true 70 | }, 71 | "outputs": [], 72 | "source": [ 73 | "#extract genes that scored as hits, normalize phenotypes, and extract information on sgRNAs from the sgIDs\n", 74 | "discriminantTable = calculateDiscriminantScores(geneTable)\n", 75 | "normedScores, maxDiscriminantTable = getNormalizedsgRNAsOverThresh(libraryTable, phenotypeTable, discriminantTable, \n", 76 | " DISCRIMANT_THRESHOLD_eg20,\n", 77 | " 3, transcripts=False)\n", 78 | "\n", 79 | "libraryTable_subset = libraryTable.loc[normedScores.dropna().index]\n", 80 | "sgInfoTable = parseAllSgIds(libraryTable_subset)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "## Generate TSS annotation using FANTOM dataset" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": { 94 | "collapsed": true 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "#first generates a table of TSS annotations\n", 99 | "#legacy function to make an intermediate table for the \"P1P2\" annotation strategy, will be replaced in future versions\n", 100 | "#TSS_TABLE_BASED_ON_ENSEMBL is table without headers with columns:\n", 101 | "#gene, transcript, chromosome, TSS coordinate, strand, annotation_source(optional)\n", 102 | "tssTable = generateTssTable(geneTable, TSS_TABLE_BASED_ON_ENSEMBL, FANTOM_TSS_ANNOTATION_BED, 200)" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": { 109 | "collapsed": true 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "#Now create a TSS annotation by searching for P1 and P2 peaks near annotated TSSs\n", 114 | "geneToAliases = generateAliasDict(HGNC_SYMBOL_LOOKUP_TABLE,gencodeData)\n", 115 | "p1p2Table = generateTssTable_P1P2strategy(tssTable.loc[tssTable.apply(lambda row: row.name[0][:6] != 'pseudo',axis=1)],\n", 116 | " FANTOM_TSS_ANNOTATION_BED, \n", 117 | " matchedp1p2Window = 30000, #region around supplied TSS annotation to search for a FANTOM P1 or P2 peak that matches the gene name (or alias)\n", 118 | " anyp1p2Window = 500, #region around supplied TSS annotation to search for the nearest P1 or P2 peak\n", 119 | " anyPeakWindow = 200, #region around supplied TSS annotation to search for any CAGE peak\n", 120 | " minDistanceForTwoTSS = 1000, #If a P1 and P2 peak are found, maximum distance at which to combine into a single annotation (with primary/secondary TSS positions)\n", 121 | " aliasDict = geneToAliases[0])\n", 122 | "#the function will report some collisions of IDs due to use of aliases and redundancy in genome, but will resolve these itself" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": { 129 | "collapsed": true 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "#save tables for downstream use\n", 134 | "tssTable = pd.read_csv(TSS_TABLE_PATH,sep='\\t', index_col=range(2))\n", 135 | "p1p2Table = pd.read_csv(P1P2_TABLE_PATH,sep='\\t', header=0, index_col=range(2))" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": { 141 | "collapsed": true 142 | }, 143 | "source": [ 144 | "## Calculate parameters for empirical sgRNAs" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": { 151 | "collapsed": true 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "#Load bigwig files for any chromatin data of interest\n", 156 | "bwhandleDict = {'dnase':BigWigFile(open('ENCODE_data/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')),\n", 157 | "'faire':BigWigFile(open('ENCODE_data/wgEncodeOpenChromFaireK562Sig.bigWig')),\n", 158 | "'mnase':BigWigFile(open('ENCODE_data/wgEncodeSydhNsomeK562Sig.bigWig'))}" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": { 165 | "collapsed": true 166 | }, 167 | "outputs": [], 168 | "source": [ 169 | "paramTable_trainingGuides = generateTypicalParamTable(libraryTable_subset,sgInfoTable, tssTable, p1p2Table, genomeDict, bwhandleDict)" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "## Fit parameters" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": { 183 | "collapsed": true 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "#populate table of fitting parameters\n", 188 | "typeList = ['binnable_onehot', \n", 189 | " 'continuous', 'continuous', 'continuous', 'continuous',\n", 190 | " 'binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot',\n", 191 | " 'binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot',\n", 192 | " 'binary']\n", 193 | "typeList.extend(['binary']*160)\n", 194 | "typeList.extend(['binary']*(16*38))\n", 195 | "typeList.extend(['binnable_onehot']*3)\n", 196 | "typeList.extend(['binnable_onehot']*2)\n", 197 | "typeList.extend(['binary']*18)\n", 198 | "fitTable = pd.DataFrame(typeList, index=paramTable_trainingGuides.columns, columns=['type'])\n", 199 | "fitparams =[{'bin width':1, 'min edge data':50, 'bin function':np.median},\n", 200 | " {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},\n", 201 | " {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},\n", 202 | " {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},\n", 203 | " {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},\n", 204 | " {'bin width':1, 'min edge data':50, 'bin function':np.median},\n", 205 | " {'bin width':1, 'min edge data':50, 'bin function':np.median},\n", 206 | " {'bin width':1, 'min edge data':50, 'bin function':np.median},\n", 207 | " {'bin width':1, 'min edge data':50, 'bin function':np.median},\n", 208 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n", 209 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n", 210 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n", 211 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n", 212 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n", 213 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n", 214 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},dict()]\n", 215 | "fitparams.extend([dict()]*160)\n", 216 | "fitparams.extend([dict()]*(16*38))\n", 217 | "fitparams.extend([\n", 218 | " {'bin width':.15, 'min edge data':50, 'bin function':np.median},\n", 219 | " {'bin width':.15, 'min edge data':50, 'bin function':np.median},\n", 220 | " {'bin width':.15, 'min edge data':50, 'bin function':np.median}])\n", 221 | "fitparams.extend([\n", 222 | " {'bin width':2, 'min edge data':50, 'bin function':np.median},\n", 223 | " {'bin width':2, 'min edge data':50, 'bin function':np.median}])\n", 224 | "fitparams.extend([dict()]*18)\n", 225 | "fitTable['params'] = fitparams" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": { 232 | "collapsed": true 233 | }, 234 | "outputs": [], 235 | "source": [ 236 | "#divide empirical data into n-folds for cross-validation\n", 237 | "geneFoldList = getGeneFolds(libraryTable_subset, 5, transcripts=False)" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": { 244 | "collapsed": true 245 | }, 246 | "outputs": [], 247 | "source": [ 248 | "#for each fold, fit parameters to training folds and measure ROC on test fold\n", 249 | "coefs = []\n", 250 | "scoreTups = []\n", 251 | "transformedParamTups = []\n", 252 | "\n", 253 | "for geneFold_train, geneFold_test in geneFoldList:\n", 254 | "\n", 255 | " transformedParams_train, estimators = fitParams(paramTable_trainingGuides.loc[normedScores.dropna().index].iloc[geneFold_train], normedScores.loc[normedScores.dropna().index].iloc[geneFold_train], fitTable)\n", 256 | "\n", 257 | " transformedParams_test = transformParams(paramTable_trainingGuides.loc[normedScores.dropna().index].iloc[geneFold_test], fitTable, estimators)\n", 258 | " \n", 259 | " reg = linear_model.ElasticNetCV(l1_ratio=[.5, .75, .9, .99,1], n_jobs=16, max_iter=2000)\n", 260 | " \n", 261 | " scaler = preprocessing.StandardScaler()\n", 262 | " reg.fit(scaler.fit_transform(transformedParams_train), normedScores.loc[normedScores.dropna().index].iloc[geneFold_train])\n", 263 | " predictedScores = pd.Series(reg.predict(scaler.transform(transformedParams_test)), index=transformedParams_test.index)\n", 264 | " testScores = normedScores.loc[normedScores.dropna().index].iloc[geneFold_test]\n", 265 | " \n", 266 | " transformedParamTups.append((scaler.transform(transformedParams_train),scaler.transform(transformedParams_test)))\n", 267 | " scoreTups.append((testScores, predictedScores))\n", 268 | " \n", 269 | " print 'Prediction AUC-ROC:', metrics.roc_auc_score((testScores >= .75).values, np.array(predictedScores.values,dtype='float64'))\n", 270 | " print 'Prediction R^2:', reg.score(scaler.transform(transformedParams_test), testScores)\n", 271 | " print 'Regression parameters:', reg.l1_ratio_, reg.alpha_\n", 272 | " coefs.append(pd.DataFrame(zip(*[abs(reg.coef_),reg.coef_]), index = transformedParams_test.columns, columns=['abs','true']))\n", 273 | " print 'Number of features used:', len(coefs[-1]) - sum(coefs[-1]['abs'] < .00000000001)" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "metadata": { 280 | "collapsed": true 281 | }, 282 | "outputs": [], 283 | "source": [ 284 | "#can select an arbitrary fold (as shown here simply the last one tested) to save state for reproducing estimators later\n", 285 | "#the pickling of the scikit-learn estimators/regressors will allow the model to be reloaded for prediction of other guide designs, \n", 286 | "# but will not be compatible across scikit-learn versions, so it is important to preserve the training data and training/test folds\n", 287 | "import cPickle\n", 288 | "estimatorString = cPickle.dumps((fitTable, estimators, scaler, reg, (geneFold_train, geneFold_test)))\n", 289 | "with open(PICKLE_FILE,'w') as outfile:\n", 290 | " outfile.write(estimatorString)\n", 291 | " \n", 292 | "#also save the transformed parameters as these can slightly differ based on the automated binning strategy\n", 293 | "transformedParams_train.head().to_csv(TRANSFORMED_PARAM_HEADER,sep='\\t')" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": { 299 | "collapsed": true 300 | }, 301 | "source": [ 302 | "# 2. Applying machine learning model to predict sgRNA activity" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": { 309 | "collapsed": true 310 | }, 311 | "outputs": [], 312 | "source": [ 313 | "#starting from a new session for demonstration purposes:\n", 314 | "%run sgRNA_learning.py\n", 315 | "import cPickle\n", 316 | "\n", 317 | "#load tssTable, p1p2Table, genome sequence, chromatin data\n", 318 | "tssTable = pd.read_csv(TSS_TABLE_PATH,sep='\\t', index_col=range(2))\n", 319 | "\n", 320 | "p1p2Table = pd.read_csv(P1P2_TABLE_PATH,sep='\\t', header=0, index_col=range(2))\n", 321 | "p1p2Table['primary TSS'] = p1p2Table['primary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0]), int(tupString.strip('()').split(', ')[1])))\n", 322 | "p1p2Table['secondary TSS'] = p1p2Table['secondary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0]),int(tupString.strip('()').split(', ')[1])))\n", 323 | "\n", 324 | "genomeDict = loadGenomeAsDict(FASTA_FILE_OF_GENOME)\n", 325 | "\n", 326 | "bwhandleDict = {'dnase':BigWigFile(open('ENCODE_data/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')),\n", 327 | "'faire':BigWigFile(open('ENCODE_data/wgEncodeOpenChromFaireK562Sig.bigWig')),\n", 328 | "'mnase':BigWigFile(open('ENCODE_data/wgEncodeSydhNsomeK562Sig.bigWig'))}\n", 329 | "\n", 330 | "#load sgRNA prediction model saved after the parameter fitting step\n", 331 | "with open(PICKLE_FILE) as infile:\n", 332 | " fitTable, estimators, scaler, reg, (geneFold_train, geneFold_test) = cPickle.load(infile)\n", 333 | " \n", 334 | "transformedParamHeader = pd.read_csv(TRANSFORMED_PARAM_HEADER,sep='\\t')" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "## Find all sgRNAs in genomic regions of interest " 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": null, 347 | "metadata": { 348 | "collapsed": true 349 | }, 350 | "outputs": [], 351 | "source": [ 352 | "#use the same p1p2Table as above or generate a new one for novel TSSs\n", 353 | "libraryTable_new, sgInfoTable_new = findAllGuides(p1p2Table, genomeDict, (-25,500))" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": null, 359 | "metadata": { 360 | "collapsed": true 361 | }, 362 | "outputs": [], 363 | "source": [ 364 | "#alternately, load tables of sgRNAs to score:\n", 365 | "libraryTable_new = pd.read_csv(LIBRARY_TABLE_PATH,sep='\\t',index_col=0)\n", 366 | "sgInfoTable_new = pd.read_csv(SGINFO_TABLE_PATH,sep='\\t',index_col=0)" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "## Predicting sgRNA activity" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": null, 379 | "metadata": { 380 | "collapsed": true 381 | }, 382 | "outputs": [], 383 | "source": [ 384 | "#calculate parameters for new sgRNAs\n", 385 | "paramTable_new = generateTypicalParamTable(libraryTable_new, sgInfoTable_new, tssTable, p1p2Table, genomeDict, bwhandleDict)" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "metadata": { 392 | "collapsed": true 393 | }, 394 | "outputs": [], 395 | "source": [ 396 | "#transform and predict scores according to sgRNA prediction model\n", 397 | "transformedParams_new = transformParams(paramTable_new, fitTable, estimators)\n", 398 | "\n", 399 | "#reconcile any differences in column headers generated by automated binning\n", 400 | "colTups = []\n", 401 | "for (l1, l2), col in transformedParams_new.iteritems():\n", 402 | " colTups.append((l1,str(l2)))\n", 403 | "transformedParams_new.columns = pd.MultiIndex.from_tuples(colTups)\n", 404 | "\n", 405 | "predictedScores_new = pd.Series(reg.predict(scaler.transform(transformedParams_new.loc[:, transformedParamHeader.columns].fillna(0).values)), index=transformedParams_new.index)" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": null, 411 | "metadata": { 412 | "collapsed": true 413 | }, 414 | "outputs": [], 415 | "source": [ 416 | "predictedScores_new.to_csv(PREDICTED_SCORE_TABLE, sep='\\t')" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "# 3. Construct sgRNA libraries\n", 424 | "## Score sgRNAs for off-target potential" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": { 431 | "collapsed": true 432 | }, 433 | "outputs": [], 434 | "source": [ 435 | "#There are many ways to score sgRNAs as off-target; below is one listed one method that is simple and flexible,\n", 436 | "#but ignores gapped alignments, alternate PAMs, and uses bowtie which may not be maximally sensitive in all cases" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": null, 442 | "metadata": { 443 | "collapsed": true 444 | }, 445 | "outputs": [], 446 | "source": [ 447 | "#output all sequences to a temporary FASTQ file for running bowtie alignment\n", 448 | "def outputTempBowtieFastq(libraryTable, outputFileName):\n", 449 | " phredString = 'I4!=======44444+++++++' #weighting for how impactful mismatches are along sgRNA sequence \n", 450 | " with open(outputFileName,'w') as outfile:\n", 451 | " for name, row in libraryTable.iterrows():\n", 452 | " outfile.write('@' + name + '\\n')\n", 453 | " outfile.write('CCN' + str(Seq.Seq(row['sequence'][1:]).reverse_complement()) + '\\n')\n", 454 | " outfile.write('+\\n')\n", 455 | " outfile.write(phredString + '\\n')\n", 456 | " \n", 457 | "outputTempBowtieFastq(libraryTable_new, TEMP_FASTQ_FILE)" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": { 464 | "collapsed": true 465 | }, 466 | "outputs": [], 467 | "source": [ 468 | "import subprocess\n", 469 | "fqFile = TEMP_FASTQ_FILE\n", 470 | "\n", 471 | "#specifying a list of parameters to run bowtie with\n", 472 | "#each tuple contains\n", 473 | "# *the mismatch threshold below which a site is considered a potential off-target (higher is more stringent)\n", 474 | "# *the number of sites allowed (1 is minimum since each sgRNA should have one true site in genome)\n", 475 | "# *the genome index against which to align the sgRNA sequences; these can be custom built to only consider sites near TSSs\n", 476 | "# *a name for the bowtie run to create appropriately named output files\n", 477 | "alignmentList = [(39,1,'~/indices/hg19.ensemblTSSflank500b','39_nearTSS'),\n", 478 | " (31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS'),\n", 479 | " (21,1,'~/indices/hg19.maskChrMandPAR','21_genome'),\n", 480 | " (31,2,'~/indices/hg19.ensemblTSSflank500b','31_2_nearTSS'),\n", 481 | " (31,3,'~/indices/hg19.ensemblTSSflank500b','31_3_nearTSS')]\n", 482 | "\n", 483 | "alignmentColumns = []\n", 484 | "for btThreshold, mflag, bowtieIndex, runname in alignmentList:\n", 485 | "\n", 486 | " alignedFile = 'bowtie_output/' + runname + '_aligned.txt'\n", 487 | " unalignedFile = 'bowtie_output/' + runname + '_unaligned.fq'\n", 488 | " maxFile = 'bowtie_output/' + runname + '_max.fq'\n", 489 | " \n", 490 | " bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile\n", 491 | " print bowtieString\n", 492 | " print subprocess.call(bowtieString, shell=True)\n", 493 | "\n", 494 | " #parse through the file of sgRNAs that exceeded \"m\", the maximum allowable alignments, and mark \"True\" any that are found\n", 495 | " with open(maxFile) as infile:\n", 496 | " sgsAligning = set()\n", 497 | " for i, line in enumerate(infile):\n", 498 | " if i%4 == 0: #id line\n", 499 | " sgsAligning.add(line.strip()[1:])\n", 500 | "\n", 501 | " alignmentColumns.append(libraryTable_new.apply(lambda row: row.name in sgsAligning, axis=1))\n", 502 | " \n", 503 | "#collate results into a table, and flip the boolean values to yield the sgRNAs that passed filter as True\n", 504 | "alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3]).ne(True)" 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "metadata": {}, 510 | "source": [ 511 | "## Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": null, 517 | "metadata": { 518 | "collapsed": true 519 | }, 520 | "outputs": [], 521 | "source": [ 522 | "#combine all generated data into one master table\n", 523 | "predictedScores_new.name = 'predicted score'\n", 524 | "v2Table = pd.concat((libraryTable_new, predictedScores_new, alignmentTable, sgInfoTable_new), axis=1, keys=['library table v2', 'predicted score', 'off-target filters', 'sgRNA info'])" 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": null, 530 | "metadata": { 531 | "collapsed": true 532 | }, 533 | "outputs": [], 534 | "source": [ 535 | "import re\n", 536 | "#for our pCRISPRi/a-v2 vector, we append flanking sequences to each sgRNA sequence for cloning and require the oligo to contain\n", 537 | "#exactly 1 BstXI and BlpI site each for cloning, and exactly 0 SbfI sites for sequencing sample preparation\n", 538 | "restrictionSites = {re.compile('CCA......TGG'):1,\n", 539 | " re.compile('GCT.AGC'):1,\n", 540 | " re.compile('CCTGCAGG'):0}\n", 541 | "\n", 542 | "def matchREsites(sequence, REdict):\n", 543 | " seq = sequence.upper()\n", 544 | " for resite, numMatchesExpected in restrictionSites.iteritems():\n", 545 | " if len(resite.findall(seq)) != numMatchesExpected:\n", 546 | " return False\n", 547 | " \n", 548 | " return True\n", 549 | "\n", 550 | "def checkOverlaps(leftPosition, acceptedLeftPositions, nonoverlapMin):\n", 551 | " for pos in acceptedLeftPositions:\n", 552 | " if abs(pos - leftPosition) < nonoverlapMin:\n", 553 | " return False\n", 554 | " return True" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": null, 560 | "metadata": { 561 | "collapsed": true 562 | }, 563 | "outputs": [], 564 | "source": [ 565 | "#flanking sequences\n", 566 | "upstreamConstant = 'CCACCTTGTTG'\n", 567 | "downstreamConstant = 'GTTTAAGAGCTAAGCTG'\n", 568 | "\n", 569 | "#minimum overlap between two sgRNAs targeting the same TSS\n", 570 | "nonoverlapMin = 3\n", 571 | "\n", 572 | "#number of sgRNAs to pick per gene/TSS\n", 573 | "sgRNAsToPick = 10\n", 574 | "\n", 575 | "#list of off-target filter (or combinations of filters) levels, matching the names in the alignment table above\n", 576 | "offTargetLevels = [['31_nearTSS', '21_genome'],\n", 577 | " ['31_nearTSS'],\n", 578 | " ['21_genome'],\n", 579 | " ['31_2_nearTSS'],\n", 580 | " ['31_3_nearTSS']]\n", 581 | "\n", 582 | "#for each gene/TSS, go through each sgRNA in descending order of predicted score\n", 583 | "#if an sgRNA passes the restriction site, overlap, and off-target filters, accept it into the library\n", 584 | "#if the number of sgRNAs accepted is less than sgRNAsToPick, reduce off-target stringency by one and continue\n", 585 | "v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')])\n", 586 | "newSgIds = []\n", 587 | "unfinishedTss = []\n", 588 | "for (gene, transcript), group in v2Groups:\n", 589 | " geneSgIds = []\n", 590 | " geneLeftPositions = []\n", 591 | " empiricalSgIds = dict()\n", 592 | " \n", 593 | " stringency = 0\n", 594 | " \n", 595 | " while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels):\n", 596 | " for sgId_v2, row in group.sort(('predicted score','predicted score'), ascending=False).iterrows():\n", 597 | " oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant\n", 598 | " leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)\n", 599 | " if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \\\n", 600 | " and matchREsites(oligoSeq, restrictionSites) \\\n", 601 | " and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):\n", 602 | " geneSgIds.append((sgId_v2,\n", 603 | " gene,transcript,\n", 604 | " row[('library table v2','sequence')], oligoSeq,\n", 605 | " row[('predicted score','predicted score')], np.nan,\n", 606 | " stringency))\n", 607 | " geneLeftPositions.append(leftPos)\n", 608 | " \n", 609 | " stringency += 1\n", 610 | " \n", 611 | " if len(geneSgIds) < sgRNAsToPick:\n", 612 | " unfinishedTss.append((gene, transcript)) #if the number of accepted sgRNAs is still less than sgRNAsToPick, discard gene\n", 613 | " else:\n", 614 | " newSgIds.extend(geneSgIds)\n", 615 | " \n", 616 | "libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence',\n", 617 | " 'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID')" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": null, 623 | "metadata": { 624 | "collapsed": true 625 | }, 626 | "outputs": [], 627 | "source": [ 628 | "#number of sgRNAs accepted at each stringency level\n", 629 | "newLibraryTable.groupby('off-target stringency').agg(len).iloc[:,0]" 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": null, 635 | "metadata": { 636 | "collapsed": true 637 | }, 638 | "outputs": [], 639 | "source": [ 640 | "#number of TSSs with fewer than required number of sgRNAs (and thus not included in the library)\n", 641 | "print len(unfinishedTss)" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": null, 647 | "metadata": { 648 | "collapsed": true 649 | }, 650 | "outputs": [], 651 | "source": [ 652 | "#Note that empirical information from previous screens can be included as well--for example:\n", 653 | "geneToDisc = maxDiscriminantTable['best score'].groupby(level=0).agg(max).to_dict()\n", 654 | "thresh = 7\n", 655 | "empiricalBonus = .2\n", 656 | "\n", 657 | "upstreamConstant = 'CCACCTTGTTG'\n", 658 | "downstreamConstant = 'GTTTAAGAGCTAAGCTG'\n", 659 | "\n", 660 | "nonoverlapMin = 3\n", 661 | "\n", 662 | "sgRNAsToPick = 10\n", 663 | "\n", 664 | "offTargetLevels = [['31_nearTSS', '21_genome'],\n", 665 | " ['31_nearTSS'],\n", 666 | " ['21_genome'],\n", 667 | " ['31_2_nearTSS'],\n", 668 | " ['31_3_nearTSS']]\n", 669 | "offTargetLevels_v1 = [[s + '_v1' for s in l] for l in offTargetLevels]\n", 670 | "\n", 671 | "v1Groups = v1Table.groupby([('relative position','gene'),('relative position','transcript')])\n", 672 | "v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')])\n", 673 | "\n", 674 | "newSgIds = []\n", 675 | "unfinishedTss = []\n", 676 | "for (gene, transcript), group in v2Groups:\n", 677 | " geneSgIds = []\n", 678 | " geneLeftPositions = []\n", 679 | " empiricalSgIds = dict()\n", 680 | " \n", 681 | " stringency = 0\n", 682 | " \n", 683 | " while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels):\n", 684 | " \n", 685 | " if gene in geneToDisc and geneToDisc[gene] >= thresh and (gene, transcript) in v1Groups.groups:\n", 686 | "\n", 687 | " for sgId_v1, row in v1Groups.get_group((gene, transcript)).sort(('Empirical activity score','Empirical activity score'),ascending=False).iterrows():\n", 688 | " oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant\n", 689 | " leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)\n", 690 | " if len(geneSgIds) < sgRNAsToPick and min(abs(row.loc['relative position'].iloc[2:])) < 5000 \\\n", 691 | " and row[('Empirical activity score','Empirical activity score')] >= .75 \\\n", 692 | " and row['off-target filters'].loc[offTargetLevels_v1[stringency]].all() \\\n", 693 | " and matchREsites(oligoSeq, restrictionSites) \\\n", 694 | " and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):\n", 695 | " if len(geneSgIds) < 2:\n", 696 | " geneSgIds.append((row[('library table v2','sgId_v2')],\n", 697 | " gene,transcript,\n", 698 | " row[('library table v2','sequence')], oligoSeq,\n", 699 | " np.nan,row[('Empirical activity score','Empirical activity score')],\n", 700 | " stringency))\n", 701 | " geneLeftPositions.append(leftPos)\n", 702 | "\n", 703 | " empiricalSgIds[row[('library table v2','sgId_v2')]] = row[('Empirical activity score','Empirical activity score')]\n", 704 | "\n", 705 | " adjustedScores = group.apply(lambda row: row[('predicted score','CRISPRiv2 predicted score')] + empiricalBonus if row.name in empiricalSgIds else row[('predicted score','CRISPRiv2 predicted score')], axis=1)\n", 706 | " adjustedScores.name = ('adjusted score','')\n", 707 | " for sgId_v2, row in pd.concat((group,adjustedScores),axis=1).sort(('adjusted score',''), ascending=False).iterrows():\n", 708 | " oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant\n", 709 | " leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)\n", 710 | " if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \\\n", 711 | " and matchREsites(oligoSeq, restrictionSites) \\\n", 712 | " and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):\n", 713 | " geneSgIds.append((sgId_v2,\n", 714 | " gene,transcript,\n", 715 | " row[('library table v2','sequence')], oligoSeq,\n", 716 | " row[('predicted score','CRISPRiv2 predicted score')], empiricalSgIds[sgId_v2] if sgId_v2 in empiricalSgIds else np.nan,\n", 717 | " stringency))\n", 718 | " geneLeftPositions.append(leftPos)\n", 719 | " \n", 720 | " stringency += 1\n", 721 | " \n", 722 | " if len(geneSgIds) < sgRNAsToPick:\n", 723 | " unfinishedTss.append((gene, transcript))\n", 724 | " else:\n", 725 | " newSgIds.extend(geneSgIds)\n", 726 | "\n", 727 | " \n", 728 | "libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence',\n", 729 | " 'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID')" 730 | ] 731 | }, 732 | { 733 | "cell_type": "markdown", 734 | "metadata": {}, 735 | "source": [ 736 | "## Design negative controls matching the base composition of the library" 737 | ] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "execution_count": null, 742 | "metadata": { 743 | "collapsed": true 744 | }, 745 | "outputs": [], 746 | "source": [ 747 | "#calcluate the base frequency at each position of the sgRNA, then generate random sequences weighted by this frequency\n", 748 | "def getBaseFrequencies(libraryTable, baseConversion = {'G':0, 'C':1, 'T':2, 'A':3}):\n", 749 | " baseArray = np.zeros((len(libraryTable),20))\n", 750 | "\n", 751 | " for i, (index, seq) in enumerate(libraryTable['protospacer sequence'].iteritems()):\n", 752 | " for j, char in enumerate(seq.upper()):\n", 753 | " baseArray[i,j] = baseConversion[char]\n", 754 | "\n", 755 | " baseTable = pd.DataFrame(baseArray, index = libraryTable.index)\n", 756 | " \n", 757 | " baseFrequencies = baseTable.apply(lambda col: col.groupby(col).agg(len)).fillna(0) / len(baseTable)\n", 758 | " baseFrequencies.index = ['G','C','T','A']\n", 759 | " \n", 760 | " baseCumulativeFrequencies = baseFrequencies.copy()\n", 761 | " baseCumulativeFrequencies.loc['C'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C']\n", 762 | " baseCumulativeFrequencies.loc['T'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T']\n", 763 | " baseCumulativeFrequencies.loc['A'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T'] + baseFrequencies.loc['A']\n", 764 | "\n", 765 | " return baseFrequencies, baseCumulativeFrequencies\n", 766 | "\n", 767 | "def generateRandomSequence(baseCumulativeFrequencies):\n", 768 | " randArray = np.random.random(baseCumulativeFrequencies.shape[1])\n", 769 | " \n", 770 | " seq = []\n", 771 | " for i, col in baseCumulativeFrequencies.iteritems():\n", 772 | " for base, freq in col.iteritems():\n", 773 | " if randArray[i] < freq:\n", 774 | " seq.append(base)\n", 775 | " break\n", 776 | " \n", 777 | " return ''.join(seq)" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": null, 783 | "metadata": { 784 | "collapsed": true 785 | }, 786 | "outputs": [], 787 | "source": [ 788 | "baseCumulativeFrequencies = getBaseFrequencies(libraryTable_complete)[1]\n", 789 | "negList = []\n", 790 | "for i in range(30000):\n", 791 | " negList.append(generateRandomSequence(baseCumulativeFrequencies))\n", 792 | "negTable = pd.DataFrame(negList, index=['non-targeting_' + str(i) for i in range(30000)], columns = ['sequence'])\n", 793 | "\n", 794 | "outputTempBowtieFastq(negTable, TEMP_FASTQ_FILE)" 795 | ] 796 | }, 797 | { 798 | "cell_type": "code", 799 | "execution_count": null, 800 | "metadata": { 801 | "collapsed": true 802 | }, 803 | "outputs": [], 804 | "source": [ 805 | "#similar to targeting sgRNA off-target scoring, but looking for sgRNAs with 0 alignments\n", 806 | "fqFile = TEMP_FASTQ_FILE\n", 807 | "\n", 808 | "alignmentList = [(31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS_negs'),\n", 809 | " (21,1,'~/indices/hg19.maskChrMandPAR','21_genome_negs')]\n", 810 | "\n", 811 | "alignmentColumns = []\n", 812 | "for btThreshold, mflag, bowtieIndex, runname in alignmentList:\n", 813 | "\n", 814 | " alignedFile = 'bowtie_output/' + runname + '_aligned.txt'\n", 815 | " unalignedFile = 'bowtie_output//' + runname + '_unaligned.fq'\n", 816 | " maxFile = 'bowtie_output/' + runname + '_max.fq'\n", 817 | " \n", 818 | " bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile\n", 819 | " print bowtieString\n", 820 | " print subprocess.call(bowtieString, shell=True)\n", 821 | "\n", 822 | " #read unaligned file for negs, and then don't flip boolean of alignmentTable\n", 823 | " with open(unalignedFile) as infile:\n", 824 | " sgsAligning = set()\n", 825 | " for i, line in enumerate(infile):\n", 826 | " if i%4 == 0: #id line\n", 827 | " sgsAligning.add(line.strip()[1:])\n", 828 | "\n", 829 | " alignmentColumns.append(negTable.apply(lambda row: row.name in sgsAligning, axis=1))\n", 830 | " \n", 831 | "alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3])\n", 832 | "alignmentTable.head()" 833 | ] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "execution_count": null, 838 | "metadata": { 839 | "collapsed": true 840 | }, 841 | "outputs": [], 842 | "source": [ 843 | "acceptedNegList = []\n", 844 | "negCount = 0\n", 845 | "for i, (name, row) in enumerate(pd.concat((negTable,alignmentTable),axis=1, keys=['seq','alignment']).iterrows()):\n", 846 | " oligo = upstreamConstant + row['seq','sequence'] + downstreamConstant\n", 847 | " if row['alignment'].all() and matchREsites(oligo, restrictionSites):\n", 848 | " acceptedNegList.append(('non-targeting_%05d' % negCount, 'negative_control', 'na', row['seq','sequence'], oligo, 0))\n", 849 | " negCount += 1\n", 850 | " \n", 851 | "acceptedNegs = pd.DataFrame(acceptedNegList, columns = ['sgId', 'gene', 'transcript', 'protospacer sequence', 'oligo sequence', 'off-target stringency']).set_index('sgId')" 852 | ] 853 | }, 854 | { 855 | "cell_type": "markdown", 856 | "metadata": {}, 857 | "source": [ 858 | "## Finalizing library design" 859 | ] 860 | }, 861 | { 862 | "cell_type": "markdown", 863 | "metadata": {}, 864 | "source": [ 865 | "* divide genes into sublibrary groups (if required)\n", 866 | "* assign negative control sgRNAs to sublibrary groups; ~1-2% of the number of sgRNAs in the library is a good rule-of-thumb\n", 867 | "* append PCR adapter sequences (~18bp) to each end of the oligo sequences to enable amplification of the oligo pool; each sublibary should have an orthogonal sequence so they can be cloned separately" 868 | ] 869 | }, 870 | { 871 | "cell_type": "code", 872 | "execution_count": null, 873 | "metadata": { 874 | "collapsed": true 875 | }, 876 | "outputs": [], 877 | "source": [] 878 | } 879 | ], 880 | "metadata": { 881 | "kernelspec": { 882 | "display_name": "Python 2", 883 | "language": "python", 884 | "name": "python2" 885 | }, 886 | "language_info": { 887 | "codemirror_mode": { 888 | "name": "ipython", 889 | "version": 2 890 | }, 891 | "file_extension": ".py", 892 | "mimetype": "text/x-python", 893 | "name": "python", 894 | "nbconvert_exporter": "python", 895 | "pygments_lexer": "ipython2", 896 | "version": "2.7.3" 897 | } 898 | }, 899 | "nbformat": 4, 900 | "nbformat_minor": 0 901 | } 902 | -------------------------------------------------------------------------------- /sgRNA_learning.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import subprocess 4 | import tempfile 5 | import multiprocessing 6 | import numpy as np 7 | import scipy as sp 8 | import pandas as pd 9 | from ConfigParser import SafeConfigParser 10 | from Bio import Seq, SeqIO 11 | import pysam 12 | from bx.bbi.bigwig_file import BigWigFile 13 | from sklearn import linear_model, svm, ensemble, preprocessing, grid_search, metrics 14 | 15 | from expt_config_parser import parseExptConfig, parseLibraryConfig 16 | 17 | ############################################################################### 18 | # Import and Merge Training/Test Data # 19 | ############################################################################### 20 | def loadExperimentData(experimentFile, supportedLibraryPath, library, basePath = '.'): 21 | libDict, librariesToTables = parseLibraryConfig(os.path.join(supportedLibraryPath, 'library_config.txt')) 22 | 23 | geneTableDict = dict() 24 | phenotypeTableDict = dict() 25 | libraryTableDict = dict() 26 | 27 | parser = SafeConfigParser() 28 | parser.read(experimentFile) 29 | for exptConfigFile in parser.sections(): 30 | configDict = parseExptConfig(exptConfigFile,libDict)[0] 31 | 32 | libraryTable = pd.read_csv(os.path.join(basePath,configDict['output_folder'],configDict['experiment_name']) + '_librarytable.txt', 33 | sep='\t', index_col=range(1), header=0) 34 | libraryTableDict[configDict['experiment_name']] = libraryTable 35 | 36 | geneTable = pd.read_csv(os.path.join(basePath,configDict['output_folder'],configDict['experiment_name']) + '_genetable.txt', 37 | sep='\t',index_col=range(2),header=range(3)) 38 | phenotypeTable = pd.read_csv(os.path.join(basePath,configDict['output_folder'],configDict['experiment_name']) + '_phenotypetable.txt',\ 39 | sep='\t',index_col=range(1),header=range(2)) 40 | 41 | condTups = [(condStr.split(':')[0],condStr.split(':')[1]) for condStr in parser.get(exptConfigFile, 'condition_tuples').strip().split('\n')] 42 | # print condTups 43 | 44 | geneTableDict[configDict['experiment_name']] = geneTable.loc[:,[level_name for level_name in geneTable.columns if (level_name[0],level_name[1]) in condTups]] 45 | phenotypeTableDict[configDict['experiment_name']] = phenotypeTable.loc[:,[level_name for level_name in phenotypeTable.columns if (level_name[0],level_name[1]) in condTups]] 46 | 47 | mergedLibraryTable = pd.concat(libraryTableDict.values()) 48 | # print mergedLibraryTable.head() 49 | mergedLibraryTable_dedup = mergedLibraryTable.drop_duplicates(['gene','sequence']) 50 | # print mergedLibraryTable_dedup.head() 51 | mergedGeneTable = pd.concat(geneTableDict.values(), keys=geneTableDict.keys(), axis = 1) 52 | # print mergedGeneTable.head() 53 | mergedPhenotypeTable = pd.concat(phenotypeTableDict.values(), keys=phenotypeTableDict.keys(), axis = 1) 54 | # print mergedPhenotypeTable.head() 55 | mergedPhenotypeTable_dedup = mergedPhenotypeTable.loc[mergedLibraryTable_dedup.index] 56 | 57 | return mergedLibraryTable_dedup, mergedPhenotypeTable_dedup, mergedGeneTable 58 | 59 | def calculateDiscriminantScores(geneTable, effectSize = 'average phenotype of strongest 3', pValue = 'Mann-Whitney p-value'): 60 | isPseudo = getPseudoIndices(geneTable) 61 | geneTable_reordered = geneTable.reorder_levels((3,0,1,2), axis=1) 62 | zscores = geneTable_reordered[effectSize] / geneTable_reordered.loc[isPseudo,effectSize].std() 63 | pvals = -1 * np.log10(geneTable_reordered[pValue]) 64 | 65 | seriesDict = dict() 66 | for group, table in pd.concat((zscores, pvals), keys=(effectSize,pValue),axis=1).reorder_levels((1,2,3,0), axis=1).groupby(level=range(3),axis=1): 67 | # print table.head() 68 | seriesDict[group] = table[group].apply(lambda row: row[effectSize] * row[pValue], axis=1) 69 | 70 | return pd.DataFrame(seriesDict) 71 | 72 | def getNormalizedsgRNAsOverThresh(libraryTable, phenotypeTable, discriminantTable, threshold, numToNormalize, transcripts=True): 73 | maxDiscriminants = pd.concat([discriminantTable.abs().idxmax(axis=1), discriminantTable.abs().max(axis=1)], keys = ('best col','best score'), axis=1) 74 | 75 | if transcripts: 76 | grouper = (libraryTable['gene'],libraryTable['transcripts']) 77 | else: 78 | grouper = libraryTable['gene'] 79 | 80 | normedPhenotypes = [] 81 | for name, group in phenotypeTable.groupby(grouper): 82 | if (transcripts and name[0] == 'negative_control') or (not transcripts and name == 'negative_control'): 83 | continue 84 | maxDisc = maxDiscriminants.loc[name] 85 | 86 | if not transcripts: 87 | maxDisc = maxDisc.sort('best score').iloc[-1] 88 | 89 | if maxDisc['best score'] >= threshold: 90 | bestGroup = group[maxDisc['best col']] 91 | normedPhenotypes.append(bestGroup / np.mean(sorted(bestGroup.dropna(), key=abs, reverse=True)[:numToNormalize])) 92 | 93 | return pd.concat(normedPhenotypes), maxDiscriminants 94 | 95 | def getGeneFolds(libraryTable, kfold, transcripts=True): 96 | if transcripts: 97 | geneGroups = pd.Series(range(len(libraryTable)), index=libraryTable.index).groupby((libraryTable['gene'],libraryTable['transcripts'])) 98 | else: 99 | geneGroups = pd.Series(range(len(libraryTable)), index=libraryTable.index).groupby(libraryTable['gene']) 100 | 101 | idxList = np.arange(geneGroups.ngroups) 102 | np.random.shuffle(idxList) 103 | 104 | foldsize = int(np.floor(geneGroups.ngroups * 1.0 / kfold)) 105 | folds = [] 106 | for i in range(kfold): 107 | testGroups = [] 108 | trainGroups = [] 109 | testSet = set(idxList[i * foldsize: (i+1) * foldsize]) 110 | for i, (name, group) in enumerate(geneGroups): 111 | if i in testSet: 112 | testGroups.extend(group.values) 113 | else: 114 | trainGroups.extend(group.values) 115 | folds.append((trainGroups,testGroups)) 116 | 117 | return folds 118 | 119 | 120 | ############################################################################### 121 | # Calculate sgRNA Parameters # 122 | ############################################################################### 123 | #tss annotations relying on library input TSSs, may want to convert to gencode in future 124 | def generateTssTable(geneTable, libraryTssFile, cagePeakFile, cageWindow, aliasDict = {'NFIK':'MKI67IP'}): 125 | codingTssList = [] 126 | with open(libraryTssFile) as infile: 127 | for line in infile: 128 | linesplit = line.strip().split('\t') 129 | try: 130 | chrom = int(linesplit[2][3:]) 131 | except ValueError: 132 | chrom = linesplit[2][3:] 133 | codingTssList.append((chrom, int(linesplit[3]), linesplit[0], linesplit[1], linesplit[2], linesplit[3], linesplit[4])) 134 | 135 | codingTupDict = {(tup[2],tup[3]):tup for tup in codingTssList} 136 | 137 | codingGeneToTransList = dict() 138 | 139 | for geneTrans in codingTupDict: 140 | if geneTrans[0] not in codingGeneToTransList: 141 | codingGeneToTransList[geneTrans[0]] = [] 142 | 143 | codingGeneToTransList[geneTrans[0]].append(geneTrans[1]) 144 | 145 | positionList = [] 146 | for (gene,transcriptList), row in geneTable.iterrows(): 147 | 148 | if gene not in codingGeneToTransList: #only pseudogenes 149 | positionList.append((np.nan,np.nan,np.nan)) 150 | continue 151 | 152 | if transcriptList == 'all': 153 | positions = [codingTupDict[(gene, trans)][1] for trans in codingGeneToTransList[gene]] 154 | else: 155 | positions = [codingTupDict[(gene, trans)][1] for trans in transcriptList.split(',')] 156 | 157 | positionList.append((np.mean(positions), codingTupDict[(gene,trans)][6], codingTupDict[(gene,trans)][4])) 158 | 159 | tssPositionTable = pd.DataFrame(positionList, index=geneTable.index, columns=['position', 'strand','chromosome']) 160 | 161 | cagePeaks = pysam.Tabixfile(cagePeakFile) 162 | halfwindow = cageWindow 163 | strictColor = '60,179,113' 164 | relaxedColor = '30,144,255' 165 | 166 | cagePeakRanges = [] 167 | for i, (gt, tssRow) in enumerate(tssPositionTable.dropna().iterrows()): 168 | peaks = cagePeaks.fetch(tssRow['chromosome'],tssRow['position'] - halfwindow,tssRow['position'] + halfwindow, parser=pysam.asBed()) 169 | 170 | ranges = [] 171 | relaxedRanges = [] 172 | for peak in peaks: 173 | # print peak 174 | if peak.strand == tssRow['strand'] and peak.itemRGB == strictColor: 175 | ranges.append((peak.start, peak.end)) 176 | elif peak.strand == tssRow['strand'] and peak.itemRGB == relaxedColor: 177 | relaxedRanges.append((peak.start, peak.end)) 178 | 179 | if len(ranges) > 0: 180 | cagePeakRanges.append(ranges) 181 | else: 182 | cagePeakRanges.append(relaxedRanges) 183 | 184 | cageSeries = pd.Series(cagePeakRanges, index = tssPositionTable.dropna().index) 185 | 186 | tssPositionTable_cage = pd.concat([tssPositionTable, cageSeries], axis=1) 187 | tssPositionTable_cage.columns = ['position', 'strand','chromosome','cage peak ranges'] 188 | return tssPositionTable_cage 189 | 190 | # for (gene, transList), row in geneTable.iterrows(): 191 | # if gene not in gencodeData and gene in aliasDict: 192 | # geneData = gencodeData[aliasDict[gene]] 193 | # else: 194 | # geneData = gencodeData[gene] 195 | 196 | def generateTssTable_P1P2strategy(tssTable, cagePeakFile, matchedp1p2Window, anyp1p2Window, anyPeakWindow, minDistanceForTwoTSS, aliasDict): 197 | cagePeaks = pysam.Tabixfile(cagePeakFile) 198 | strictColor = '60,179,113' 199 | relaxedColor = '30,144,255' 200 | 201 | resultRows = [] 202 | for gene, tssRowGroup in tssTable.groupby(level=0): 203 | 204 | if len(set(tssRowGroup['chromosome'].values)) == 1: 205 | chrom = tssRowGroup['chromosome'].values[0] 206 | else: 207 | raise ValueError('mutliple annotated chromosomes for ' + gene) 208 | 209 | if len(set(tssRowGroup['strand'].values)) == 1: 210 | strand = tssRowGroup['strand'].values[0] 211 | else: 212 | raise ValueError('mutliple annotated strands for ' + gene) 213 | 214 | #try to match P1/P2 names within the window 215 | # peaks = cagePeaks.fetch(chrom,max(0,tssRowGroup['position'].min() - matchedp1p2Window),tssRowGroup['position'].max() + matchedp1p2Window, parser=pysam.asBed()) 216 | peaks = [] 217 | for transcript, row in tssRowGroup.iterrows(): 218 | peaks.extend([p for p in cagePeaks.fetch(chrom,max(0,row['position'] - matchedp1p2Window),row['position'] + matchedp1p2Window, parser=pysam.asBed())]) 219 | p1Matches = set() 220 | p2Matches = set() 221 | for peak in peaks: 222 | if peak.strand == strand and matchPeakName(peak.name, aliasDict[gene] if gene in aliasDict else [gene], 'p1'): 223 | p1Matches.add((peak.start,peak.end)) 224 | elif peak.strand == strand and matchPeakName(peak.name, aliasDict[gene] if gene in aliasDict else [gene], 'p2') and peak.itemRGB == strictColor: 225 | p2Matches.add((peak.start,peak.end)) 226 | p1Matches = list(p1Matches) 227 | p2Matches = list(p2Matches) 228 | 229 | if len(p1Matches) >= 1: 230 | if len(p1Matches) > 1: 231 | print 'multiple matched p1:', gene, p1Matches, p2Matches #rare event, typically a doubly-named TSS, basically at the same spot 232 | 233 | closestMatch = p1Matches[0] 234 | for match in p1Matches: 235 | if min(abs(match[0] - tssRowGroup['position'])) < min(abs(closestMatch[0] - tssRowGroup['position'])): 236 | closestMatch = match 237 | p1Matches = [closestMatch] 238 | 239 | if len(p2Matches) > 1: 240 | print 'multiple matched p2:', gene, p1Matches, p2Matches 241 | 242 | closestMatch = p2Matches[0] 243 | for match in p2Matches: 244 | if min(abs(match[0] - tssRowGroup['position'])) < min(abs(closestMatch[0] - tssRowGroup['position'])): 245 | closestMatch = match 246 | p2Matches = [closestMatch] 247 | 248 | if len(p2Matches) == 0 or abs(p1Matches[0][0] - p2Matches[0][0]) <= minDistanceForTwoTSS: 249 | resultRows.append((gene,'P1P2', chrom, strand, 'CAGE, matched peaks', p1Matches[0], p2Matches[0] if len(p2Matches) > 0 else p1Matches[0])) 250 | else: 251 | resultRows.append((gene,'P1', chrom, strand, 'CAGE, matched peaks', p1Matches[0], p1Matches[0])) 252 | resultRows.append((gene,'P2', chrom, strand, 'CAGE, matched peaks', p2Matches[0], p2Matches[0])) 253 | 254 | 255 | #try to match any P1/P2 names 256 | else: 257 | peaks = [] 258 | for transcript, row in tssRowGroup.iterrows(): 259 | peaks.extend([p for p in cagePeaks.fetch(chrom,max(0,row['position'] - anyp1p2Window),row['position'] + anyp1p2Window, parser=pysam.asBed())]) 260 | p1Matches = set() 261 | p2Matches = set() 262 | for peak in peaks: 263 | if peak.strand == strand and peak.name.find('p1@') != -1: 264 | p1Matches.add((peak.start,peak.end)) 265 | elif peak.strand == strand and peak.name.find('p2@') != -1 and peak.itemRGB == strictColor: 266 | p2Matches.add((peak.start,peak.end)) 267 | p1Matches = list(p1Matches) 268 | p2Matches = list(p2Matches) 269 | 270 | if len(p1Matches) >=1: 271 | if len(p1Matches) > 1: 272 | print 'multiple nearby p1:', gene, p1Matches, p2Matches 273 | 274 | closestMatch = p1Matches[0] 275 | for match in p1Matches: 276 | if min(abs(match[0] - tssRowGroup['position'])) < min(abs(closestMatch[0] - tssRowGroup['position'])): 277 | closestMatch = match 278 | p1Matches = [closestMatch] 279 | 280 | if len(p2Matches) > 1: 281 | print 'multiple nearby p2:', gene, p1Matches, p2Matches 282 | 283 | closestMatch = p2Matches[0] 284 | for match in p2Matches: 285 | if min(abs(match[0] - tssRowGroup['position'])) < min(abs(closestMatch[0] - tssRowGroup['position'])): 286 | closestMatch = match 287 | p2Matches = [closestMatch] 288 | 289 | if len(p2Matches) == 0 or abs(p1Matches[0][0] - p2Matches[0][0]) <= minDistanceForTwoTSS: 290 | resultRows.append((gene,'P1P2', chrom, strand, 'CAGE, primary peaks', p1Matches[0], p2Matches[0] if len(p2Matches) > 0 else p1Matches[0])) 291 | else: 292 | resultRows.append((gene,'P1', chrom, strand, 'CAGE, primary peaks', p1Matches[0], p1Matches[0])) 293 | resultRows.append((gene,'P2', chrom, strand, 'CAGE, primary peaks', p2Matches[0], p2Matches[0])) 294 | 295 | 296 | #try to match robust or permissive peaks 297 | else: 298 | for transcript, row in tssRowGroup.iterrows(): 299 | peaks = cagePeaks.fetch(chrom,max(0,row['position']) - anyPeakWindow,row['position'] + anyPeakWindow, parser=pysam.asBed()) 300 | robustPeaks = [] 301 | permissivePeaks = [] 302 | for peak in peaks: 303 | if peak.strand == strand and peak.itemRGB == strictColor: 304 | robustPeaks.append((peak.start,peak.end)) 305 | if peak.strand == strand and peak.itemRGB == relaxedColor: 306 | permissivePeaks.append((peak.start,peak.end)) 307 | 308 | if len(robustPeaks) >= 1: 309 | if strand == '+': 310 | resultRows.append((gene,transcript[1], chrom, strand, 'CAGE, robust peak', robustPeaks[0], robustPeaks[-1])) 311 | else: 312 | resultRows.append((gene,transcript[1], chrom, strand, 'CAGE, robust peak', robustPeaks[-1], robustPeaks[0])) 313 | elif len(permissivePeaks) >= 1: 314 | if strand == '+': 315 | resultRows.append((gene,transcript[1], chrom, strand, 'CAGE permissive peak', permissivePeaks[0], permissivePeaks[-1])) 316 | else: 317 | resultRows.append((gene,transcript[1], chrom, strand, 'CAGE permissive peak', permissivePeaks[-1], permissivePeaks[0])) 318 | else: 319 | resultRows.append((gene, transcript[1], chrom, strand, 'Annotation', (row['position'],row['position']), (row['position'],row['position']))) 320 | 321 | return pd.DataFrame(resultRows, columns=['gene','transcript','chromosome','strand','TSS source','primary TSS','secondary TSS']).set_index(keys=['gene','transcript']) 322 | 323 | def generateSgrnaDistanceTable_p1p2Strategy(sgInfoTable, libraryTable, p1p2Table, transcripts=False): 324 | sgDistanceSeries = [] 325 | 326 | if transcripts == False: # when sgRNAs weren't designed based on the p1p2 strategy 327 | for name, group in sgInfoTable['pam coordinate'].groupby(libraryTable['gene']): 328 | if name in p1p2Table.index: 329 | tssRow = p1p2Table.loc[name] 330 | 331 | if len(tssRow) == 1: 332 | tssRow = tssRow.iloc[0] 333 | for sgId, pamCoord in group.iteritems(): 334 | if tssRow['strand'] == '+': 335 | sgDistanceSeries.append((sgId, name, tssRow.name, 336 | pamCoord - tssRow['primary TSS'][0], 337 | pamCoord - tssRow['primary TSS'][1], 338 | pamCoord - tssRow['secondary TSS'][0], 339 | pamCoord - tssRow['secondary TSS'][1])) 340 | else: 341 | sgDistanceSeries.append((sgId, name, tssRow.name, 342 | (pamCoord - tssRow['primary TSS'][1]) * -1, 343 | (pamCoord - tssRow['primary TSS'][0]) * -1, 344 | (pamCoord - tssRow['secondary TSS'][1]) * -1, 345 | (pamCoord - tssRow['secondary TSS'][0]) * -1)) 346 | 347 | else: 348 | for sgId, pamCoord in group.iteritems(): 349 | closestTssRow = tssRow.loc[tssRow.apply(lambda row: abs(pamCoord - row['primary TSS'][0]), axis=1).idxmin()] 350 | 351 | if closestTssRow['strand'] == '+': 352 | sgDistanceSeries.append((sgId, name, closestTssRow.name, 353 | pamCoord - closestTssRow['primary TSS'][0], 354 | pamCoord - closestTssRow['primary TSS'][1], 355 | pamCoord - closestTssRow['secondary TSS'][0], 356 | pamCoord - closestTssRow['secondary TSS'][1])) 357 | else: 358 | sgDistanceSeries.append((sgId, name, closestTssRow.name, 359 | (pamCoord - closestTssRow['primary TSS'][1]) * -1, 360 | (pamCoord - closestTssRow['primary TSS'][0]) * -1, 361 | (pamCoord - closestTssRow['secondary TSS'][1]) * -1, 362 | (pamCoord - closestTssRow['secondary TSS'][0]) * -1)) 363 | else: 364 | for name, group in sgInfoTable['pam coordinate'].groupby([libraryTable['gene'],libraryTable['transcripts']]): 365 | if name in p1p2Table.index: 366 | tssRow = p1p2Table.loc[[name]] 367 | 368 | if len(tssRow) == 1: 369 | tssRow = tssRow.iloc[0] 370 | for sgId, pamCoord in group.iteritems(): 371 | if tssRow['strand'] == '+': 372 | sgDistanceSeries.append((sgId, tssRow.name[0], tssRow.name[1], 373 | pamCoord - tssRow['primary TSS'][0], 374 | pamCoord - tssRow['primary TSS'][1], 375 | pamCoord - tssRow['secondary TSS'][0], 376 | pamCoord - tssRow['secondary TSS'][1])) 377 | else: 378 | sgDistanceSeries.append((sgId, tssRow.name[0], tssRow.name[1], 379 | (pamCoord - tssRow['primary TSS'][1]) * -1, 380 | (pamCoord - tssRow['primary TSS'][0]) * -1, 381 | (pamCoord - tssRow['secondary TSS'][1]) * -1, 382 | (pamCoord - tssRow['secondary TSS'][0]) * -1)) 383 | 384 | else: 385 | print name, tssRow 386 | raise ValueError('all gene/trans pairs should be unique') 387 | 388 | return pd.DataFrame(sgDistanceSeries, columns=['sgId', 'gene', 'transcript', 'primary TSS-Up', 'primary TSS-Down', 'secondary TSS-Up', 'secondary TSS-Down']).set_index(keys=['sgId']) 389 | 390 | def generateSgrnaDistanceTable(sgInfoTable, tssTable, libraryTable): 391 | sgDistanceSeries = [] 392 | 393 | for name, group in sgInfoTable['pam coordinate'].groupby([libraryTable['gene'],libraryTable['transcripts']]): 394 | if name in tssTable.index: 395 | tssRow = tssTable.loc[name] 396 | if len(tssRow['cage peak ranges']) != 0: 397 | spotList = [] 398 | for rangeTup in tssRow['cage peak ranges']: 399 | spotList.append((rangeTup[0] - tssRow['position']) * (-1 if tssRow['strand'] == '-' else 1)) 400 | spotList.append((rangeTup[1] - tssRow['position']) * (-1 if tssRow['strand'] == '-' else 1)) 401 | 402 | sgDistanceSeries.append(group.apply(lambda row: distanceMetrics(row, tssRow['position'], min(spotList),max(spotList),tssRow['strand']))) 403 | 404 | else: 405 | sgDistanceSeries.append(group.apply(lambda row: distanceMetrics(row, tssRow['position'], 0, 0, tssRow['strand']))) 406 | 407 | return pd.concat(sgDistanceSeries) 408 | 409 | def distanceMetrics(position, annotatedTss, cageUp, cageDown, strand): 410 | relativePos = (position - annotatedTss) * (1 if strand == '+' else -1) 411 | 412 | return pd.Series((relativePos, relativePos-cageUp, relativePos-cageDown), index=('annotated','cageUp','cageDown')) 413 | 414 | def generateSgrnaLengthSeries(libraryTable): 415 | lengthSeries = libraryTable.apply(lambda row: len(row['sequence']),axis=1) 416 | lengthSeries.name = 'length' 417 | return lengthSeries 418 | 419 | def generateRelativeBasesAndStrand(sgInfoTable, tssTable, libraryTable, genomeDict): 420 | relbases = [] 421 | strands = [] 422 | sgIds = [] 423 | for gene, sgInfoGroup in sgInfoTable.groupby(libraryTable['gene']): 424 | tssRowGroup = tssTable.loc[gene] 425 | 426 | if len(set(tssRowGroup['chromosome'].values)) == 1: 427 | chrom = tssRowGroup['chromosome'].values[0] 428 | else: 429 | raise ValueError('mutliple annotated chromosomes for ' + gene) 430 | 431 | if len(set(tssRowGroup['strand'].values)) == 1: 432 | strand = tssRowGroup['strand'].values[0] 433 | else: 434 | raise ValueError('mutliple annotated strands for ' + gene) 435 | 436 | for sg, sgInfo in sgInfoGroup.iterrows(): 437 | sgIds.append(sg) 438 | geneTup = (sgInfo['gene_name'],','.join(sgInfo['transcript_list'])) 439 | strands.append(True if sgInfo['strand'] == strand else False) 440 | 441 | baseMatrix = [] 442 | for pos in np.arange(-30,10): 443 | baseMatrix.append(getBaseRelativeToPam(chrom, sgInfo['pam coordinate'],sgInfo['length'], sgInfo['strand'], pos, genomeDict)) 444 | relbases.append(baseMatrix) 445 | 446 | relbases = pd.DataFrame(relbases, index = sgIds, columns = np.arange(-30,10)).loc[libraryTable.index] 447 | strands = pd.DataFrame(strands, index = sgIds, columns = ['same strand']).loc[libraryTable.index] 448 | 449 | return relbases, strands 450 | 451 | def generateBooleanBaseTable(baseTable): 452 | relbases_bool = [] 453 | for base in ['A','G','C','T']: 454 | relbases_bool.append(baseTable.applymap(lambda val: val == base)) 455 | 456 | return pd.concat(relbases_bool, keys=['A','G','C','T'], axis=1) 457 | 458 | def generateBooleanDoubleBaseTable(baseTable): 459 | doubleBaseTable = [] 460 | tableCols = [] 461 | for b1 in ['A','G','C','T']: 462 | for b2 in ['A','G','C','T']: 463 | for i in np.arange(-30,8): 464 | doubleBaseTable.append(pd.concat((baseTable[i] == b1, baseTable[i+1] == b2),axis=1).all(axis=1)) 465 | tableCols.append(((b1,b2),i)) 466 | return pd.concat(doubleBaseTable, keys=tableCols, axis=1) 467 | 468 | def getBaseRelativeToPam(chrom, pamPos, length, strand, relPos, genomeDict): 469 | rc = {'A':'T','T':'A','G':'C','C':'G','N':'N'} 470 | #print chrom, pamPos, relPos 471 | if strand == '+': 472 | return rc[genomeDict[chrom][pamPos - relPos].upper()] 473 | elif strand == '-': 474 | return genomeDict[chrom][pamPos + relPos].upper() 475 | else: 476 | raise ValueError() 477 | 478 | def getMaxLengthHomopolymer(sequence, base): 479 | sequence = sequence.upper() 480 | base = base.upper() 481 | 482 | maxBaseCount = 0 483 | curBaseCount = 0 484 | for b in sequence: 485 | if b == base: 486 | curBaseCount += 1 487 | else: 488 | maxBaseCount = max((curBaseCount, maxBaseCount)) 489 | curBaseCount = 0 490 | 491 | return max((curBaseCount, maxBaseCount)) 492 | 493 | def getFractionBaseList(sequence, baseList): 494 | baseSet = [base.upper() for base in baseList] 495 | counter = 0.0 496 | for b in sequence.upper(): 497 | if b in baseSet: 498 | counter += 1.0 499 | 500 | return counter / len(sequence) 501 | 502 | #need to fix file naming 503 | def getRNAfoldingTable(libraryTable): 504 | tempfile_fa = tempfile.NamedTemporaryFile('w+t', delete=False) 505 | tempfile_rnafold = tempfile.NamedTemporaryFile('w+t', delete=False) 506 | 507 | for name, row in libraryTable.iterrows(): 508 | tempfile_fa.write('>' + name + '\n' + row['sequence'] + '\n') 509 | 510 | tempfile_fa.close() 511 | tempfile_rnafold.close() 512 | # print tempfile_fa.name, tempfile_rnafold.name 513 | 514 | subprocess.call('RNAfold --noPS < %s > %s' % (tempfile_fa.name, tempfile_rnafold.name), shell=True) 515 | 516 | mfeSeries_noScaffold = parseViennaMFE(tempfile_rnafold.name, libraryTable) 517 | isPaired = parseViennaPairing(tempfile_rnafold.name, libraryTable) 518 | 519 | tempfile_fa = tempfile.NamedTemporaryFile('w+t', delete=False) 520 | tempfile_rnafold = tempfile.NamedTemporaryFile('w+t', delete=False) 521 | 522 | with open(tempfile_fa.name,'w') as outfile: 523 | for name, row in libraryTable.iterrows(): 524 | outfile.write('>' + name + '\n' + row['sequence'] + 'GTTTAAGAGCTAAGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT\n') 525 | 526 | tempfile_fa.close() 527 | tempfile_rnafold.close() 528 | # print tempfile_fa.name, tempfile_rnafold.name 529 | 530 | subprocess.call('RNAfold --noPS < %s > %s' % (tempfile_fa.name, tempfile_rnafold.name), shell=True) 531 | 532 | mfeSeries_wScaffold = parseViennaMFE(tempfile_rnafold.name, libraryTable) 533 | 534 | return pd.concat((mfeSeries_noScaffold, mfeSeries_wScaffold, isPaired), keys=('no scaffold', 'with scaffold', 'is Paired'), axis=1) 535 | 536 | def parseViennaMFE(viennaOutputFile, libraryTable): 537 | mfeList = [] 538 | with open(viennaOutputFile) as infile: 539 | for i, line in enumerate(infile): 540 | if i%3 == 2: 541 | mfeList.append(float(line.strip().strip('.() '))) 542 | return pd.Series(mfeList, index=libraryTable.index, name='RNA minimum free energy') 543 | 544 | def parseViennaPairing(viennaOutputFile, libraryTable): 545 | paired = [] 546 | with open(viennaOutputFile) as infile: 547 | for i, line in enumerate(infile): 548 | if i%3 == 2: 549 | foldString = line.strip().split(' ')[0] 550 | paired.append([char != '.' for char in foldString[-18:]]) 551 | return pd.DataFrame(paired, index=libraryTable.index, columns = range(-20,-2)) 552 | 553 | def getChromatinDataSeries(bigwigFile, libraryTable, sgInfoTable, tssTable, colname = '', naValue = 0): 554 | bwindex = BigWigFile(open(bigwigFile)) 555 | chromDict = tssTable['chromosome'].to_dict() 556 | 557 | chromatinScores = [] 558 | for name, sgInfo in sgInfoTable.iterrows(): 559 | geneTup = (sgInfo['gene_name'],','.join(sgInfo['transcript_list'])) 560 | 561 | if geneTup not in chromDict: #negative controls 562 | chromatinScores.append(np.nan) 563 | continue 564 | 565 | if sgInfo['strand'] == '+': 566 | sgRange = sgInfo['pam coordinate'] + sgInfo['length'] 567 | else: 568 | sgRange = sgInfo['pam coordinate'] - sgInfo['length'] 569 | 570 | chrom = chromDict[geneTup] 571 | 572 | chromatinArray = bwindex.get_as_array(chrom, min(sgInfo['pam coordinate'], sgRange), max(sgInfo['pam coordinate'], sgRange)) 573 | if chromatinArray is not None and len(chromatinArray) > 0: 574 | chromatinScores.append(np.nanmean(chromatinArray)) 575 | else: #often chrY when using K562 data.. 576 | # print name 577 | # print chrom, min(sgInfo['pam coordinate'], sgRange), max(sgInfo['pam coordinate'], sgRange) 578 | chromatinScores.append(np.nan) 579 | 580 | chromatinSeries = pd.Series(chromatinScores, index=libraryTable.index, name = colname) 581 | 582 | return chromatinSeries.fillna(naValue) 583 | 584 | def getChromatinDataSeriesByGene(bigwigFileHandle, libraryTable, sgInfoTable, p1p2Table, sgrnaDistanceTable_p1p2, colname = '', naValue = 0, normWindow = 1000): 585 | bwindex = bigwigFileHandle #BigWigFile(open(bigwigFile)) 586 | 587 | chromatinScores = [] 588 | for (gene, transcript), sgInfoGroup in sgInfoTable.groupby([sgrnaDistanceTable_p1p2['gene'], sgrnaDistanceTable_p1p2['transcript']]): 589 | tssRow = p1p2Table.loc[[(gene, transcript)]].iloc[0,:] 590 | 591 | chrom = tssRow['chromosome'] 592 | 593 | normWindowArray = bwindex.get_as_array(chrom, max(0, tssRow['primary TSS'][0] - normWindow), tssRow['primary TSS'][0] + normWindow) 594 | if normWindowArray is not None: 595 | normFactor = np.nanmax(normWindowArray) 596 | else: 597 | normFactor = 1 598 | 599 | windowMin = max(0, min(sgInfoGroup['pam coordinate']) - max(sgInfoGroup['length']) - 10) 600 | windowMax = max(sgInfoGroup['pam coordinate']) + max(sgInfoGroup['length']) + 10 601 | chromatinWindow = bwindex.get_as_array(chrom, windowMin, windowMax) 602 | 603 | chromatinScores.append(sgInfoGroup.apply(lambda row: getChromatinData(row, chromatinWindow, windowMin, normFactor), axis=1)) 604 | 605 | 606 | chromatinSeries = pd.concat(chromatinScores) 607 | 608 | return chromatinSeries.fillna(naValue) 609 | 610 | def getChromatinData(sgInfoRow, chromatinWindowArray, windowMin, normFactor): 611 | if sgInfoRow['strand'] == '+': 612 | sgRange = sgInfoRow['pam coordinate'] + sgInfoRow['length'] 613 | else: 614 | sgRange = sgInfoRow['pam coordinate'] - sgInfoRow['length'] 615 | 616 | 617 | if chromatinWindowArray is not None:# and len(chromatinWindowArray) > 0: 618 | chromatinArray = chromatinWindowArray[min(sgInfoRow['pam coordinate'], sgRange) - windowMin: max(sgInfoRow['pam coordinate'], sgRange) - windowMin] 619 | return np.nanmean(chromatinArray)/normFactor 620 | else: #often chrY when using K562 data.. 621 | # print name 622 | # print chrom, min(sgInfo['pam coordinate'], sgRange), max(sgInfo['pam coordinate'], sgRange) 623 | return np.nan 624 | 625 | def generateTypicalParamTable(libraryTable, sgInfoTable, tssTable, p1p2Table, genomeDict, bwFileHandleDict, transcripts=False): 626 | lengthSeries = generateSgrnaLengthSeries(libraryTable) 627 | 628 | # sgrnaPositionTable = generateSgrnaDistanceTable(sgInfoTable, tssTable, libraryTable) 629 | sgrnaPositionTable_p1p2 = generateSgrnaDistanceTable_p1p2Strategy(sgInfoTable, libraryTable, p1p2Table, transcripts) 630 | 631 | baseTable, strand = generateRelativeBasesAndStrand(sgInfoTable, tssTable, libraryTable, genomeDict) 632 | booleanBaseTable = generateBooleanBaseTable(baseTable) 633 | doubleBaseTable = generateBooleanDoubleBaseTable(baseTable) 634 | 635 | printNow('.') 636 | baseList = ['A','G','C','T'] 637 | homopolymerTable = pd.concat([libraryTable.apply(lambda row: np.floor(getMaxLengthHomopolymer(row['sequence'], base)), axis=1) for base in baseList],keys=baseList,axis=1) 638 | 639 | baseFractions = pd.concat([libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['A']),axis=1), 640 | libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['G']),axis=1), 641 | libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['C']),axis=1), 642 | libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['T']),axis=1), 643 | libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['G','C']),axis=1), 644 | libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['G','A']),axis=1), 645 | libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['C','A']),axis=1)],keys=['A','G','C','T','GC','purine','CA'],axis=1) 646 | 647 | printNow('.') 648 | 649 | dnaseSeries = getChromatinDataSeriesByGene(bwFileHandleDict['dnase'], libraryTable, sgInfoTable, p1p2Table, sgrnaPositionTable_p1p2) 650 | printNow('.') 651 | faireSeries = getChromatinDataSeriesByGene(bwFileHandleDict['faire'], libraryTable, sgInfoTable, p1p2Table, sgrnaPositionTable_p1p2) 652 | printNow('.') 653 | mnaseSeries = getChromatinDataSeriesByGene(bwFileHandleDict['mnase'], libraryTable, sgInfoTable, p1p2Table, sgrnaPositionTable_p1p2) 654 | printNow('.') 655 | 656 | rnafolding = getRNAfoldingTable(libraryTable) 657 | 658 | printNow('Done!') 659 | 660 | return pd.concat([lengthSeries, 661 | sgrnaPositionTable_p1p2.iloc[:,2:], 662 | homopolymerTable, 663 | baseFractions, 664 | strand, 665 | booleanBaseTable['A'], 666 | booleanBaseTable['T'], 667 | booleanBaseTable['G'], 668 | booleanBaseTable['C'], 669 | doubleBaseTable, 670 | pd.concat([dnaseSeries,faireSeries,mnaseSeries],keys=['DNase','FAIRE','MNase'], axis=1), 671 | rnafolding['no scaffold'], 672 | rnafolding['with scaffold'], 673 | rnafolding['is Paired']],keys=['length', 674 | 'distance', 675 | 'homopolymers', 676 | 'base fractions', 677 | 'strand', 678 | 'base table-A', 679 | 'base table-T', 680 | 'base table-G', 681 | 'base table-C', 682 | 'base dimers', 683 | 'accessibility', 684 | 'RNA folding-no scaffold', 685 | 'RNA folding-with scaffold', 686 | 'RNA folding-pairing, no scaffold'],axis=1) 687 | 688 | # def generateTypicalParamTable_parallel(libraryTable, sgInfoTable, tssTable, p1p2Table, genomeDict, bwFileHandleDict, processors): 689 | # processPool = multiprocessing.Pool(processors) 690 | 691 | # colTupList = zip([group for gene, group in libraryTable.groupby(libraryTable['gene'])], 692 | # [group for gene, group in sgInfoTable.groupby(libraryTable['gene'])]) 693 | 694 | # result = processPool.map(lambda colTup: generateTypicalParamTable(colTup[0], colTup[1], tssTable, p1p2Table, genomeDict,bwFileHandleDict), colTupList) 695 | 696 | # return pd.concat(result) 697 | 698 | ############################################################################### 699 | # Learn Parameter Weights # 700 | ############################################################################### 701 | def fitParams(paramTable, scoreTable, fitTable): 702 | predictedParams = [] 703 | estimators = [] 704 | 705 | for i, (name, col) in enumerate(paramTable.iteritems()): 706 | 707 | fitRow = fitTable.iloc[i] 708 | 709 | if fitRow['type'] == 'binary': #binary parameter 710 | # print name, 'is binary parameter' 711 | predictedParams.append(col) 712 | estimators.append('binary') 713 | 714 | elif fitRow['type'] == 'continuous': 715 | col_reshape = col.values.reshape(len(col),1) 716 | parameters = fitRow['params'] 717 | 718 | svr = svm.SVR(cache_size=500) 719 | clf = grid_search.GridSearchCV(svr, parameters, n_jobs=16, verbose=0) 720 | clf.fit(col_reshape, scoreTable) 721 | 722 | print name, clf.best_params_ 723 | predictedParams.append(pd.Series(clf.predict(col_reshape), index=col.index, name=name)) 724 | estimators.append(clf.best_estimator_) 725 | 726 | elif fitRow['type'] == 'binnable': 727 | parameters = fitRow['params'] 728 | 729 | assignedBins = binValues(col, parameters['bin width'], parameters['min edge data']) 730 | groupStats = scoreTable.groupby(assignedBins).agg(parameters['bin function']) 731 | 732 | # print name 733 | # print pd.concat((groupStats,scoreTable.groupby(assignedBins).size()), axis=1) 734 | 735 | binnedScores = assignedBins.apply(lambda binVal: groupStats.loc[binVal]) 736 | 737 | predictedParams.append(binnedScores) 738 | estimators.append(groupStats) 739 | 740 | elif fitRow['type'] == 'binnable_onehot': 741 | parameters = fitRow['params'] 742 | 743 | assignedBins = binValues(col, parameters['bin width'], parameters['min edge data']) 744 | binGroups = scoreTable.groupby(assignedBins) 745 | groupStats = binGroups.agg(parameters['bin function']) 746 | 747 | # print name 748 | # print pd.concat((groupStats,scoreTable.groupby(assignedBins).size()), axis=1) 749 | 750 | oneHotFrame = pd.DataFrame(np.zeros((len(assignedBins),len(binGroups))), index = assignedBins.index, \ 751 | columns=pd.MultiIndex.from_tuples([(name[0],', '.join([name[1],key])) for key in sorted(binGroups.groups.keys())])) 752 | 753 | for groupName, group in binGroups: 754 | oneHotFrame.loc[group.index, (name[0],', '.join([name[1],groupName]))] = 1 755 | 756 | predictedParams.append(oneHotFrame) 757 | estimators.append(groupStats) 758 | 759 | else: 760 | raise ValueError(fitRow['type'] + 'not implemented') 761 | 762 | return pd.concat(predictedParams, axis=1), estimators 763 | 764 | def binValues(col, binsize, minEdgePoints=0, edgeOffset = None): 765 | bins = np.floor(col / binsize) * binsize 766 | 767 | if minEdgePoints <= 0: 768 | if edgeOffset == None: 769 | return bins.apply(lambda binVal: str(binVal)) 770 | else: 771 | return bins 772 | elif minEdgePoints >= len(col): 773 | raise ValueError('too few data points to meet minimum edge requirements') 774 | else: 775 | binGroups = bins.groupby(bins) 776 | binCounts = binGroups.agg(len).sort_index() 777 | 778 | i = 0 779 | leftBin = [] 780 | if binCounts.iloc[i] < minEdgePoints: 781 | leftCount = 0 782 | while leftCount < minEdgePoints: 783 | leftCount += binCounts.iloc[i] 784 | leftBin.append(binCounts.index[i]) 785 | i += 1 786 | 787 | leftLessThan = binCounts.index[i] 788 | 789 | j = -1 790 | rightBin = [] 791 | if binCounts.iloc[j] < minEdgePoints: 792 | rightCount = 0 793 | while rightCount < minEdgePoints: 794 | rightBin.append(binCounts.index[j]) 795 | rightCount += binCounts.iloc[j] 796 | j -= 1 797 | 798 | rightMoreThan = binCounts.index[j + 1] 799 | 800 | if i > len(binCounts) + j: 801 | raise ValueError('min edge requirements cannot be met') 802 | 803 | if edgeOffset == None: #return strings for bins, fine for grouping, problems for plotting 804 | return bins.apply(lambda binVal: '< %f' % leftLessThan if binVal in leftBin else('>= %f' % rightMoreThan if binVal in rightBin else str(binVal))) 805 | else: #apply arbitrary offset instead to ease plotting 806 | return bins.apply(lambda binVal: leftLessThan - edgeOffset if binVal in leftBin else(rightMoreThan + edgeOffset if binVal in rightBin else binVal)) 807 | 808 | def transformParams(paramTable, fitTable, estimators): 809 | transformedParams = [] 810 | 811 | for i, (name, col) in enumerate(paramTable.iteritems()): 812 | fitRow = fitTable.iloc[i] 813 | 814 | if fitRow['type'] == 'binary': 815 | transformedParams.append(col) 816 | elif fitRow['type'] == 'continuous': 817 | col_reshape = col.values.reshape(len(col),1) 818 | transformedParams.append(pd.Series(estimators[i].predict(col_reshape), index=col.index, name=name)) 819 | elif fitRow['type'] == 'binnable': 820 | binStats = estimators[i] 821 | assignedBins = applyBins(col, binStats.index.values) 822 | transformedParams.append(assignedBins.apply(lambda binVal: binStats.loc[binVal])) 823 | 824 | elif fitRow['type'] == 'binnable_onehot': 825 | binStats = estimators[i] 826 | 827 | assignedBins = applyBins(col, binStats.index.values) 828 | binGroups = col.groupby(assignedBins) 829 | 830 | # print name 831 | # print pd.concat((groupStats,scoreTable.groupby(assignedBins).size()), axis=1) 832 | 833 | oneHotFrame = pd.DataFrame(np.zeros((len(assignedBins),len(binGroups))), index = assignedBins.index, \ 834 | columns=pd.MultiIndex.from_tuples([(name[0],', '.join([name[1],key])) for key in sorted(binGroups.groups.keys())])) 835 | 836 | for groupName, group in binGroups: 837 | oneHotFrame.loc[group.index, (name[0],', '.join([name[1],groupName]))] = 1 838 | 839 | transformedParams.append(oneHotFrame) 840 | 841 | return pd.concat(transformedParams, axis=1) 842 | 843 | def applyBins(column, binStrings): 844 | leftLabel = '' 845 | rightLabel = '' 846 | binTups = [] 847 | for binVal in binStrings: 848 | if binVal[0] == '<': 849 | leftLabel = binVal 850 | elif binVal[0] == '>': 851 | rightLabel = binVal 852 | rightBound = float(binVal[3:]) 853 | else: 854 | binTups.append((float(binVal),binVal)) 855 | 856 | binTups.sort() 857 | # print binTups 858 | leftBound = binTups[0][0] 859 | if leftLabel == '': 860 | leftLabel = binTups[0][1] 861 | 862 | if rightLabel == '': 863 | rightLabel = binTups[-1][1] 864 | rightBound = binTups[-1][0] 865 | 866 | def binFunc(val): 867 | return leftLabel if val < leftBound else (rightLabel if val >= rightBound else [tup[1] for tup in binTups if val >= tup[0]][-1]) 868 | 869 | return column.apply(binFunc) 870 | 871 | ############################################################################### 872 | # Predict sgRNA Scores and Library?? # 873 | ############################################################################### 874 | def findAllGuides(p1p2Table, genomeDict, rangeTup, sgRNALength=20): 875 | newLibraryTable = [] 876 | newSgInfoTable = [] 877 | 878 | for tssTup, tssRow in p1p2Table.iterrows(): 879 | rangeStart = min(min(tssRow['primary TSS']), min(tssRow['secondary TSS'])) + (rangeTup[0] if tssRow['strand'] == '+' else -1 * rangeTup[1]) 880 | rangeEnd = max(max(tssRow['primary TSS']), max(tssRow['secondary TSS'])) + (rangeTup[1] if tssRow['strand'] == '+' else -1 * rangeTup[0]) 881 | 882 | genomeRange = str(genomeDict[tssRow['chromosome']][rangeStart:rangeEnd + 1].seq) 883 | 884 | rangeLength = rangeEnd + 1 - rangeStart 885 | for posOffset in range(rangeLength): 886 | if genomeRange[posOffset:posOffset+1+1].upper() == 'CC' \ 887 | and posOffset + 3 + sgRNALength < rangeLength \ 888 | and 'N' not in genomeRange[posOffset:posOffset+3+sgRNALength].upper(): 889 | pamCoord = rangeStart+posOffset 890 | sgId = tssTup[0] + '_' + '+' + '_' + str(pamCoord) + '.' + str(sgRNALength + 3) + '-' + tssTup[1] 891 | gene = tssTup[0] 892 | transcripts = tssTup[1] 893 | rawSequence = genomeRange[posOffset:posOffset+3+sgRNALength] 894 | sequence = 'G' + str(Seq.Seq(rawSequence[3:-1]).reverse_complement()) 895 | 896 | newLibraryTable.append((sgId, gene, transcripts, sequence, rawSequence)) 897 | newSgInfoTable.append((sgId, 'None', gene, sgRNALength + 3, pamCoord, 'not assigned', pamCoord, '+', tssTup[1].split(','))) 898 | 899 | elif genomeRange[posOffset-1:posOffset+1].upper() == 'GG' \ 900 | and posOffset - 3 - sgRNALength >= 0 \ 901 | and 'N' not in genomeRange[posOffset + 1 - 3 - sgRNALength:posOffset+1].upper(): 902 | pamCoord = rangeStart+posOffset 903 | sgId = tssTup[0] + '_' + '-' + '_' + str(pamCoord) + '.' + str(sgRNALength + 3) + '-' + tssTup[1] 904 | gene = tssTup[0] 905 | transcripts = tssTup[1] 906 | rawSequence = genomeRange[posOffset + 1 - 3 - sgRNALength:posOffset+1] 907 | sequence = 'G' + rawSequence[1:-3] 908 | 909 | newLibraryTable.append((sgId, gene, transcripts, sequence, rawSequence)) 910 | newSgInfoTable.append((sgId, 'None', gene, sgRNALength + 3, pamCoord, 'not assigned', pamCoord, '-', tssTup[1].split(','))) 911 | 912 | return pd.DataFrame(newLibraryTable,columns=['sgId','gene','transcripts','sequence', 'genomic sequence']).set_index('sgId'), \ 913 | pd.DataFrame(newSgInfoTable, columns=['sgId','Sublibrary','gene_name', 'length', 'pam coordinate','pass_score','position','strand', 'transcript_list']).set_index('sgId') 914 | 915 | ############################################################################### 916 | # Utility Functions # 917 | ############################################################################### 918 | def getPseudoIndices(table): 919 | return table.apply(lambda row: row.name[0][:6] == 'pseudo', axis=1) 920 | 921 | def loadGencodeData(gencodeGTF, indexByENSG = True): 922 | printNow('Loading annotation file...') 923 | gencodeData = dict() 924 | with open(gencodeGTF) as gencodeFile: 925 | for line in gencodeFile: 926 | if line[0] != '#': 927 | linesplit = line.strip().split('\t') 928 | attrsplit = linesplit[-1].strip('; ').split('; ') 929 | attrdict = {attr.split(' ')[0]:attr.split(' ')[1].strip('\"') for attr in attrsplit if attr[:3] !='tag'} 930 | attrdict['tags'] = [attr.split(' ')[1].strip('\"') for attr in attrsplit if attr[:3] == 'tag'] 931 | 932 | if indexByENSG: 933 | dictKey = attrdict['gene_id'].split('.')[0] 934 | else: 935 | dictKey = attrdict['gene_name'] 936 | 937 | #catch y-linked pseudoautosomal genes 938 | if 'PAR' in attrdict['tags'] and linesplit[0] == 'chrY': 939 | continue 940 | 941 | if linesplit[2] == 'gene':# and attrdict['gene_type'] == 'protein_coding': 942 | gencodeData[dictKey] = ([linesplit[0],long(linesplit[3]),long(linesplit[4]),linesplit[6], attrdict],[]) 943 | elif linesplit[2] == 'transcript': 944 | gencodeData[dictKey][1].append([linesplit[0],long(linesplit[3]),long(linesplit[4]),linesplit[6], attrdict]) 945 | 946 | printNow('Done\n') 947 | 948 | return gencodeData 949 | 950 | def loadGenomeAsDict(genomeFasta): 951 | printNow('Loading genome file...') 952 | genomeDict = SeqIO.to_dict(SeqIO.parse(genomeFasta,'fasta')) 953 | printNow('Done\n') 954 | return genomeDict 955 | 956 | def loadCageBedData(cageBedFile, matchList = ['p1','p2']): 957 | cageBedDict = {match:dict() for match in matchList} 958 | 959 | with open(cageBedFile) as infile: 960 | for line in infile: 961 | linesplit = line.strip().split('\t') 962 | 963 | for name in linesplit[3].split(','): 964 | namesplit = name.split('@') 965 | if len(namesplit) == 2: 966 | for match in matchList: 967 | if namesplit[0] == match: 968 | cageBedDict[match][namesplit[1]] = linesplit 969 | 970 | return cageBedDict 971 | 972 | def matchPeakName(peakName, geneAliasList, promoterRank): 973 | for peakString in peakName.split(','): 974 | peakSplit = peakString.split('@') 975 | 976 | if len(peakSplit) == 2\ 977 | and peakSplit[0] == promoterRank\ 978 | and peakSplit[1] in geneAliasList: 979 | return True 980 | 981 | if len(peakSplit) > 2: 982 | print peakName 983 | 984 | return False 985 | 986 | def generateAliasDict(hgncFile, gencodeData): 987 | hgncTable = pd.read_csv(hgncFile,sep='\t', header=0).fillna('') 988 | 989 | geneToAliases = dict() 990 | geneToENSG = dict() 991 | 992 | for i, row in hgncTable.iterrows(): 993 | geneToAliases[row['Approved Symbol']] = [row['Approved Symbol']] 994 | geneToAliases[row['Approved Symbol']].extend([] if len(row['Previous Symbols']) == 0 else [name.strip() for name in row['Previous Symbols'].split(',')]) 995 | geneToAliases[row['Approved Symbol']].extend([] if len(row['Synonyms']) == 0 else [name.strip() for name in row['Synonyms'].split(',')]) 996 | 997 | geneToENSG[row['Approved Symbol']] = row['Ensembl Gene ID'] 998 | 999 | # for gene in gencodeData: 1000 | # if gene not in geneToAliases: 1001 | # geneToAliases[gene] = [gene] 1002 | 1003 | # geneToAliases[gene].extend([tr[-1]['transcript_id'].split('.')[0] for tr in gencodeData[gene][1]]) 1004 | 1005 | return geneToAliases, geneToENSG 1006 | 1007 | #Parse information from the sgRNA ID standard format 1008 | def parseSgId(sgId): 1009 | parseDict = dict() 1010 | 1011 | #sublibrary 1012 | if len(sgId.split('=')) == 2: 1013 | parseDict['Sublibrary'] = sgId.split('=')[0] 1014 | remainingId = sgId.split('=')[1] 1015 | else: 1016 | parseDict['Sublibrary'] = None 1017 | remainingId = sgId 1018 | 1019 | #gene name and strand 1020 | underscoreSplit = remainingId.split('_') 1021 | 1022 | for i,item in enumerate(underscoreSplit): 1023 | if item == '+': 1024 | strand = '+' 1025 | geneName = '_'.join(underscoreSplit[:i]) 1026 | remainingId = '_'.join(underscoreSplit[i+1:]) 1027 | break 1028 | elif item == '-': 1029 | strand = '-' 1030 | geneName = '_'.join(underscoreSplit[:i]) 1031 | remainingId = '_'.join(underscoreSplit[i+1:]) 1032 | break 1033 | else: 1034 | continue 1035 | 1036 | parseDict['strand'] = strand 1037 | parseDict['gene_name'] = geneName 1038 | 1039 | #position 1040 | dotSplit = remainingId.split('.') 1041 | parseDict['position'] = int(dotSplit[0]) 1042 | remainingId = '.'.join(dotSplit[1:]) 1043 | 1044 | #length incl pam 1045 | dashSplit = remainingId.split('-') 1046 | parseDict['length'] = int(dashSplit[0]) 1047 | remainingId = '-'.join(dashSplit[1:]) 1048 | 1049 | #pass score 1050 | tildaSplit = remainingId.split('~') 1051 | parseDict['pass_score'] = tildaSplit[-1] 1052 | remainingId = '~'.join(tildaSplit[:-1]) #should always be length 1 anyway 1053 | 1054 | #transcripts 1055 | parseDict['transcript_list'] = remainingId.split(',') 1056 | 1057 | return parseDict 1058 | 1059 | def parseAllSgIds(libraryTable): 1060 | sgInfoList = [] 1061 | for sgId, row in libraryTable.iterrows(): 1062 | sgInfo = parseSgId(sgId) 1063 | 1064 | #fix pam coordinates for -strand ?? 1065 | if sgInfo['strand'] == '-': 1066 | sgInfo['pam coordinate'] = sgInfo['position'] #+ 1 1067 | else: 1068 | sgInfo['pam coordinate'] = sgInfo['position'] 1069 | 1070 | sgInfoList.append(sgInfo) 1071 | 1072 | return pd.DataFrame(sgInfoList, index=libraryTable.index) 1073 | 1074 | def printNow(outputString): 1075 | sys.stdout.write(outputString) 1076 | sys.stdout.flush() 1077 | -------------------------------------------------------------------------------- /Library_design_walkthrough.md: -------------------------------------------------------------------------------- 1 | 2 | 1. Learning sgRNA predictors from empirical data 3 | * Load scripts and empirical data 4 | * Generate TSS annotation using FANTOM dataset 5 | * Calculate parameters for empirical sgRNAs 6 | * Fit parameters 7 | 2. Applying machine learning model to predict sgRNA activity 8 | * Find all sgRNAs in genomic regions of interest 9 | * Predicting sgRNA activity 10 | 3. Construct sgRNA libraries 11 | * Score sgRNAs for off-target potential 12 | * Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering 13 | * Design negative controls matching the base composition of the library 14 | * Finalizing library design 15 | 16 | # 1. Learning sgRNA predictors from empirical data 17 | ## Load scripts and empirical data 18 | 19 | 20 | ```python 21 | import sys 22 | sys.path.insert(0, '../ScreenProcessing/') 23 | %run sgRNA_learning.py 24 | ``` 25 | 26 | 27 | ```python 28 | genomeDict = loadGenomeAsDict('large_data_files/hg19.fa') 29 | ``` 30 | 31 | Loading genome file...Done 32 | 33 | 34 | 35 | ```python 36 | #to use pre-calculated sgRNA activity score data (e.g. provided CRISPRi training data), load the following: 37 | #CRISPRa activity score data also included in data_files 38 | libraryTable_training = pd.read_csv('data_files/CRISPRi_trainingdata_libraryTable.txt', sep='\t', index_col = 0) 39 | libraryTable_training.head() 40 | ``` 41 | 42 | 43 | 44 | 45 |
46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 |
sublibrarygenetranscriptssequence
sgId
Drug_Targets+Kinase_Phosphatase=AARS_+_70323216.24-all~e39m1drug_targets+kinase_phosphataseAARSallGCCCCAGGATCAGGCCCCGCG
Drug_Targets+Kinase_Phosphatase=AARS_+_70323296.24-all~e39m1drug_targets+kinase_phosphataseAARSallGGCCGCCCTCGGAGAGCTCTG
Drug_Targets+Kinase_Phosphatase=AARS_+_70323318.24-all~e39m1drug_targets+kinase_phosphataseAARSallGACGGCGACCCTAGGAGAGGT
Drug_Targets+Kinase_Phosphatase=AARS_+_70323362.24-all~e39m1drug_targets+kinase_phosphataseAARSallGGTGCAGCGGGCCCTTGGCGG
Drug_Targets+Kinase_Phosphatase=AARS_+_70323441.24-all~e39m1drug_targets+kinase_phosphataseAARSallGCGCTCTGATTGGACGGAGCG
101 |
102 | 103 | 104 | 105 | 106 | ```python 107 | sgInfoTable_training = pd.read_csv('data_files/CRISPRi_trainingdata_sgRNAInfoTable.txt', sep='\t', index_col=0) 108 | sgInfoTable_training.head() 109 | ``` 110 | 111 | 112 | 113 | 114 |
115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 |
Sublibrarygene_namelengthpam coordinatepass_scorepositionstrandtranscript_list
sgId
Drug_Targets+Kinase_Phosphatase=AARS_+_70323216.24-all~e39m1Drug_Targets+Kinase_PhosphataseAARS2470323216e39m170323216+['all']
Drug_Targets+Kinase_Phosphatase=AARS_+_70323296.24-all~e39m1Drug_Targets+Kinase_PhosphataseAARS2470323296e39m170323296+['all']
Drug_Targets+Kinase_Phosphatase=AARS_+_70323318.24-all~e39m1Drug_Targets+Kinase_PhosphataseAARS2470323318e39m170323318+['all']
Drug_Targets+Kinase_Phosphatase=AARS_+_70323362.24-all~e39m1Drug_Targets+Kinase_PhosphataseAARS2470323362e39m170323362+['all']
Drug_Targets+Kinase_Phosphatase=AARS_+_70323441.24-all~e39m1Drug_Targets+Kinase_PhosphataseAARS2470323441e39m170323441+['all']
198 |
199 | 200 | 201 | 202 | 203 | ```python 204 | activityScores = pd.read_csv('data_files/CRISPRi_trainingdata_activityScores.txt',sep='\t',index_col=0, header=None).iloc[:,0] 205 | activityScores.head() 206 | ``` 207 | 208 | 209 | 210 | 211 | 0 212 | Drug_Targets+Kinase_Phosphatase=AARS_+_70323216.24-all~e39m1 0.348892 213 | Drug_Targets+Kinase_Phosphatase=AARS_+_70323296.24-all~e39m1 0.912409 214 | Drug_Targets+Kinase_Phosphatase=AARS_+_70323318.24-all~e39m1 0.997242 215 | Drug_Targets+Kinase_Phosphatase=AARS_+_70323362.24-all~e39m1 0.962154 216 | Drug_Targets+Kinase_Phosphatase=AARS_+_70323441.24-all~e39m1 0.019320 217 | Name: 1, dtype: float64 218 | 219 | 220 | 221 | 222 | ```python 223 | tssTable = pd.read_csv('data_files/human_tssTable.txt',sep='\t', index_col=range(2)) 224 | tssTable.head() 225 | ``` 226 | 227 | 228 | 229 | 230 |
231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 |
positionstrandchromosomecage peak ranges
genetranscripts
A1BGall58864864-chr19[(58864822, 58864847), (58864848, 58864868)]
A1CFENST00000373993.152619744-chr10[]
ENST00000374001.2,ENST00000373997.3,ENST00000282641.252645434-chr10[(52645379, 52645393), (52645416, 52645444)]
A2Mall9268752-chr12[(9268547, 9268556), (9268559, 9268568), (9268...
A2ML1all8975067+chr12[(8975061, 8975072), (8975101, 8975108), (8975...
292 |
293 | 294 | 295 | 296 | 297 | ```python 298 | p1p2Table = pd.read_csv('data_files/human_p1p2Table.txt',sep='\t', header=0, index_col=range(2)) 299 | p1p2Table['primary TSS'] = p1p2Table['primary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0].split('.')[0]), int(tupString.strip('()').split(', ')[1].split('.')[0]))) 300 | p1p2Table['secondary TSS'] = p1p2Table['secondary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0].split('.')[0]),int(tupString.strip('()').split(', ')[1].split('.')[0]))) 301 | p1p2Table.head() 302 | ``` 303 | 304 | 305 | 306 | 307 |
308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 |
chromosomestrandTSS sourceprimary TSSsecondary TSS
genetranscript
A1BGP1chr19-CAGE, matched peaks(58858938, 58859039)(58858938, 58859039)
P2chr19-CAGE, matched peaks(58864822, 58864847)(58864822, 58864847)
A1CFP1P2chr10-CAGE, matched peaks(52645379, 52645393)(52645379, 52645393)
A2MP1P2chr12-CAGE, matched peaks(9268507, 9268523)(9268528, 9268542)
A2ML1P1P2chr12+CAGE, matched peaks(8975206, 8975223)(8975144, 8975169)
376 |
377 | 378 | 379 | 380 | ## Calculate parameters for empirical sgRNAs 381 | 382 | ### Because scikit-learn currently does not support any robust method for saving and re-loading the machine learning model, the best strategy is to simply re-learn the model from the training data 383 | 384 | 385 | ```python 386 | #Load bigwig files for any chromatin data of interest 387 | bwhandleDict = {'dnase':BigWigFile(open('large_data_files/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')), 388 | 'faire':BigWigFile(open('large_data_files/wgEncodeOpenChromFaireK562Sig.bigWig')), 389 | 'mnase':BigWigFile(open('large_data_files/wgEncodeSydhNsomeK562Sig.bigWig'))} 390 | ``` 391 | 392 | 393 | ```python 394 | paramTable_trainingGuides = generateTypicalParamTable(libraryTable_training,sgInfoTable_training, tssTable, p1p2Table, genomeDict, bwhandleDict) 395 | ``` 396 | 397 | .... 398 | 399 | /usr/local/lib/python2.7/dist-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice 400 | warnings.warn("Mean of empty slice", RuntimeWarning) 401 | /usr/local/lib/python2.7/dist-packages/numpy/lib/nanfunctions.py:326: RuntimeWarning: All-NaN slice encountered 402 | warnings.warn("All-NaN slice encountered", RuntimeWarning) 403 | 404 | 405 | .Done! 406 | 407 | ## Fit parameters 408 | 409 | 410 | ```python 411 | #load in the 5-fold cross-validation splits used to generate the model 412 | import cPickle 413 | with open('data_files/CRISPRi_trainingdata_traintestsets.txt') as infile: 414 | geneFold_train, geneFold_test, fitTable = cPickle.load(infile) 415 | ``` 416 | 417 | 418 | ```python 419 | transformedParams_train, estimators = fitParams(paramTable_trainingGuides.loc[activityScores.dropna().index].iloc[geneFold_train], activityScores.loc[activityScores.dropna().index].iloc[geneFold_train], fitTable) 420 | 421 | transformedParams_test = transformParams(paramTable_trainingGuides.loc[activityScores.dropna().index].iloc[geneFold_test], fitTable, estimators) 422 | 423 | reg = linear_model.ElasticNetCV(l1_ratio=[.5, .75, .9, .99,1], n_jobs=16, max_iter=2000) 424 | 425 | scaler = preprocessing.StandardScaler() 426 | reg.fit(scaler.fit_transform(transformedParams_train), activityScores.loc[activityScores.dropna().index].iloc[geneFold_train]) 427 | predictedScores = pd.Series(reg.predict(scaler.transform(transformedParams_test)), index=transformedParams_test.index) 428 | testScores = activityScores.loc[activityScores.dropna().index].iloc[geneFold_test] 429 | 430 | print 'Prediction AUC-ROC:', metrics.roc_auc_score((testScores >= .75).values, np.array(predictedScores.values,dtype='float64')) 431 | print 'Prediction R^2:', reg.score(scaler.transform(transformedParams_test), testScores) 432 | print 'Regression parameters:', reg.l1_ratio_, reg.alpha_ 433 | coefs = pd.DataFrame(zip(*[abs(reg.coef_),reg.coef_]), index = transformedParams_test.columns, columns=['abs','true']) 434 | print 'Number of features used:', len(coefs) - sum(coefs['abs'] < .00000000001) 435 | ``` 436 | 437 | ('distance', 'primary TSS-Up') {'C': 0.05, 'gamma': 0.0001} 438 | ('distance', 'primary TSS-Down') {'C': 0.5, 'gamma': 5e-05} 439 | ('distance', 'secondary TSS-Up') {'C': 0.1, 'gamma': 5e-05} 440 | ('distance', 'secondary TSS-Down') {'C': 0.1, 'gamma': 5e-05} 441 | 442 | 443 | /home/mhorlbeck/.local/lib/python2.7/site-packages/sklearn/utils/validation.py:323: UserWarning: StandardScaler assumes floating point values as input, got object 444 | "got %s" % (estimator, X.dtype)) 445 | /home/mhorlbeck/.local/lib/python2.7/site-packages/sklearn/linear_model/base.py:400: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison 446 | if precompute == 'auto': 447 | 448 | 449 | Prediction AUC-ROC: 0.803109696478 450 | Prediction R^2: 0.31263687609 451 | Regression parameters: 0.5 0.00534455043278 452 | Number of features used: 327 453 | 454 | 455 | 456 | ```python 457 | #can save state for reproducing estimators later 458 | #the pickling of the scikit-learn estimators/regressors will allow the model to be reloaded for prediction of other guide designs, 459 | # but will not be compatible across scikit-learn versions, so it is important to preserve the training data and training/test folds 460 | import cPickle 461 | estimatorString = cPickle.dumps((fitTable, estimators, scaler, reg)) 462 | with open(PICKLE_FILE,'w') as outfile: 463 | outfile.write(estimatorString) 464 | 465 | #also save the transformed parameters as these can slightly differ based on the automated binning strategy 466 | transformedParams_train.head().to_csv(TRANSFORMED_PARAM_HEADER,sep='\t') 467 | ``` 468 | 469 | # 2. Applying machine learning model to predict sgRNA activity 470 | 471 | ## Generate TSS annotation using FANTOM dataset 472 | 473 | 474 | ```python 475 | #you can supply any table of gene transcription start sites formatted as below 476 | #for demonstration purposes, the rest of this walkthrough will use a small arbitrary subset of the protein coding TSS table 477 | tssTable_new = tssTable.iloc[10:20, :-1] 478 | tssTable_new.head() 479 | ``` 480 | 481 | 482 | 483 | 484 |
485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 |
positionstrandchromosome
genetranscripts
AADACL2all151451714+chr3
AADATENST00000337664.4171011117-chr4
ENST00000337664.4,ENST00000509167.1,ENST00000353187.2171011284-chr4
ENST00000509167.1,ENST00000515480.1,ENST00000353187.2171011424-chr4
AAED1all99417562-chr9
538 |
539 | 540 | 541 | 542 | 543 | ```python 544 | #if desired, use the ensembl annotation and the HGNC database to supply gene aliases to assist P1P2 matching in the next step 545 | gencodeData = loadGencodeData('large_data_files/gencode.v19.annotation.gtf') 546 | geneToAliases = generateAliasDict('large_data_files/20150424_HGNC_symbols.txt',gencodeData) 547 | ``` 548 | 549 | Loading annotation file...Done 550 | 551 | 552 | 553 | ```python 554 | #Now create a TSS annotation by searching for P1 and P2 peaks near annotated TSSs 555 | #same parameters as for our lncRNA libraries 556 | p1p2Table_new = generateTssTable_P1P2strategy(tssTable_new, 'large_data_files/TSS_human.sorted.bed.gz', 557 | matchedp1p2Window = 30000, #region around supplied TSS annotation to search for a FANTOM P1 or P2 peak that matches the gene name (or alias) 558 | anyp1p2Window = 500, #region around supplied TSS annotation to search for the nearest P1 or P2 peak 559 | anyPeakWindow = 200, #region around supplied TSS annotation to search for any CAGE peak 560 | minDistanceForTwoTSS = 1000, #If a P1 and P2 peak are found, maximum distance at which to combine into a single annotation (with primary/secondary TSS positions) 561 | aliasDict = geneToAliases[0]) 562 | #the function will report some collisions of IDs due to use of aliases and redundancy in genome, but will resolve these itself 563 | ``` 564 | 565 | 566 | ```python 567 | p1p2Table_new.head() 568 | ``` 569 | 570 | 571 | 572 | 573 |
574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 |
chromosomestrandTSS sourceprimary TSSsecondary TSS
genetranscript
AADACL2P1P2chr3+CAGE, matched peaks(151451707, 151451722)(151451707, 151451722)
AADATP1P2chr4-CAGE, matched peaks(171011323, 171011408)(171011084, 171011147)
AAED1P1P2chr9-CAGE, matched peaks(99417562, 99417609)(99417615, 99417622)
AAGABP1P2chr15-CAGE, matched peaks(67546963, 67547024)(67546963, 67547024)
AAK1P1P2chr2-CAGE, matched peaks(69870747, 69870812)(69870854, 69870878)
643 |
644 | 645 | 646 | 647 | 648 | ```python 649 | p1p2Table_new.groupby('TSS source').agg(len).iloc[:,[2]] 650 | ``` 651 | 652 | 653 | 654 | 655 |
656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 |
primary TSS
TSS source
CAGE, matched peaks8
674 |
675 | 676 | 677 | 678 | 679 | ```python 680 | len(p1p2Table_new) 681 | ``` 682 | 683 | 684 | 685 | 686 | 8 687 | 688 | 689 | 690 | 691 | ```python 692 | #save tables 693 | tssTable_alllncs.to_csv(TSS_TABLE_PATH,sep='\t', index_col=range(2)) 694 | p1p2Table_alllncs.to_csv(P1P2_TABLE_PATH,sep='\t', header=0, index_col=range(2)) 695 | ``` 696 | 697 | ## Find all sgRNAs in genomic regions of interest 698 | 699 | 700 | ```python 701 | libraryTable_new, sgInfoTable_new = findAllGuides(p1p2Table_new, genomeDict, 702 | (-25,500)) #region around P1P2 TSSs to search for new sgRNAs; recommend -550,-25 for CRISPRa 703 | ``` 704 | 705 | 706 | ```python 707 | len(libraryTable_new) 708 | ``` 709 | 710 | 711 | 712 | 713 | 1125 714 | 715 | 716 | 717 | 718 | ```python 719 | libraryTable_new.head() 720 | ``` 721 | 722 | 723 | 724 | 725 |
726 | 727 | 728 | 729 | 730 | 731 | 732 | 733 | 734 | 735 | 736 | 737 | 738 | 739 | 740 | 741 | 742 | 743 | 744 | 745 | 746 | 747 | 748 | 749 | 750 | 751 | 752 | 753 | 754 | 755 | 756 | 757 | 758 | 759 | 760 | 761 | 762 | 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 |
genetranscriptssequencegenomic sequence
sgId
AADACL2_+_151451720.23-P1P2AADACL2P1P2GTAGACTTGGGAACTCTCTCCCTGAGAGAGTTCCCAAGTCTAC
AADACL2_+_151451732.23-P1P2AADACL2P1P2GGTAGAGCAATTGTAGACTTCCCAAGTCTACAATTGCTCTACT
AADACL2_+_151451733.23-P1P2AADACL2P1P2GAGTAGAGCAATTGTAGACTCCAAGTCTACAATTGCTCTACTA
AADACL2_-_151451809.23-P1P2AADACL2P1P2GCTCAGTACTGTGAAGAAGCTCTCAGTACTGTGAAGAAGCTGG
AADACL2_-_151451816.23-P1P2AADACL2P1P2GCTGTGAAGAAGCTGGAAAAACTGTGAAGAAGCTGGAAAAAGG
781 |
782 | 783 | 784 | 785 | ## Predicting sgRNA activity 786 | 787 | 788 | ```python 789 | #calculate parameters for new sgRNAs 790 | paramTable_new = generateTypicalParamTable(libraryTable_new, sgInfoTable_new, tssTable_new, p1p2Table_new, genomeDict, bwhandleDict) 791 | ``` 792 | 793 | .....Done! 794 | 795 | 796 | ```python 797 | paramTable_new.head() 798 | ``` 799 | 800 | 801 | 802 | 803 |
804 | 805 | 806 | 807 | 808 | 809 | 810 | 811 | 812 | 813 | 814 | 815 | 816 | 817 | 818 | 819 | 820 | 821 | 822 | 823 | 824 | 825 | 826 | 827 | 828 | 829 | 830 | 831 | 832 | 833 | 834 | 835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 | 851 | 852 | 853 | 854 | 855 | 856 | 857 | 858 | 859 | 860 | 861 | 862 | 863 | 864 | 865 | 866 | 867 | 868 | 869 | 870 | 871 | 872 | 873 | 874 | 875 | 876 | 877 | 878 | 879 | 880 | 881 | 882 | 883 | 884 | 885 | 886 | 887 | 888 | 889 | 890 | 891 | 892 | 893 | 894 | 895 | 896 | 897 | 898 | 899 | 900 | 901 | 902 | 903 | 904 | 905 | 906 | 907 | 908 | 909 | 910 | 911 | 912 | 913 | 914 | 915 | 916 | 917 | 918 | 919 | 920 | 921 | 922 | 923 | 924 | 925 | 926 | 927 | 928 | 929 | 930 | 931 | 932 | 933 | 934 | 935 | 936 | 937 | 938 | 939 | 940 | 941 | 942 | 943 | 944 | 945 | 946 | 947 | 948 | 949 | 950 | 951 | 952 | 953 | 954 | 955 | 956 | 957 | 958 | 959 | 960 | 961 | 962 | 963 | 964 | 965 | 966 | 967 | 968 | 969 | 970 | 971 | 972 | 973 | 974 | 975 | 976 | 977 | 978 | 979 | 980 | 981 | 982 | 983 | 984 | 985 | 986 |
lengthdistancehomopolymersbase fractions...RNA folding-pairing, no scaffold
lengthprimary TSS-Upprimary TSS-Downsecondary TSS-Upsecondary TSS-DownAGCTA...-12-11-10-9-8-7-6-5-4-3
sgId
AADACL2_+_151451720.23-P1P22013-213-223120.20...TrueTrueFalseFalseFalseFalseTrueTrueTrueTrue
AADACL2_+_151451732.23-P1P2202510251022120.30...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
AADACL2_+_151451733.23-P1P2202611261121120.35...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
AADACL2_-_151451809.23-P1P220102871028721110.30...FalseTrueTrueTrueFalseFalseFalseFalseFalseFalse
AADACL2_-_151451816.23-P1P220109941099442110.40...TrueTrueTrueFalseFalseFalseFalseFalseFalseFalse
987 |

5 rows × 808 columns

988 |
989 | 990 | 991 | 992 | 993 | ```python 994 | #if starting from a separate session from where you ran the sgRNA learning steps of Part 1, reload the following 995 | import cPickle 996 | with open(PICKLE_FILE) as infile: 997 | fitTable, estimators, scaler, reg = cPickle.load(infile) 998 | 999 | transformedParams_train = pd.read_csv(TRANSFORMED_PARAM_HEADER,sep='\t') 1000 | ``` 1001 | 1002 | 1003 | ```python 1004 | #transform and predict scores according to sgRNA prediction model 1005 | transformedParams_new = transformParams(paramTable_new, fitTable, estimators) 1006 | 1007 | #reconcile any differences in column headers generated by automated binning 1008 | colTups = [] 1009 | for (l1, l2), col in transformedParams_new.iteritems(): 1010 | colTups.append((l1,str(l2))) 1011 | transformedParams_new.columns = pd.MultiIndex.from_tuples(colTups) 1012 | 1013 | predictedScores_new = pd.Series(reg.predict(scaler.transform(transformedParams_new.loc[:, transformedParams_train.columns].fillna(0).values)), index=transformedParams_new.index) 1014 | ``` 1015 | 1016 | 1017 | ```python 1018 | predictedScores_new.head() 1019 | ``` 1020 | 1021 | 1022 | 1023 | 1024 | sgId 1025 | AADACL2_+_151451720.23-P1P2 0.641245 1026 | AADACL2_+_151451732.23-P1P2 0.693926 1027 | AADACL2_+_151451733.23-P1P2 0.655759 1028 | AADACL2_-_151451809.23-P1P2 0.500835 1029 | AADACL2_-_151451816.23-P1P2 0.434376 1030 | dtype: float64 1031 | 1032 | 1033 | 1034 | 1035 | ```python 1036 | libraryTable_new.to_csv(LIBRARY_TABLE_PATH,sep='\t') 1037 | sgInfoTable_new.to_csv(sgRNA_INFO_PATH,sep='\t') 1038 | predictedScores_new.to_csv(PREDICTED_SCORES_PATH, sep='\t') 1039 | ``` 1040 | 1041 | # 3. Construct sgRNA libraries 1042 | ## Score sgRNAs for off-target potential 1043 | 1044 | ### There are many ways to score sgRNAs as off-target; below is one listed one method that is simple and flexible, but ignores gapped alignments, alternate PAMs, and uses bowtie which may not be maximally sensitive in all cases 1045 | 1046 | 1047 | ```python 1048 | !mkdir temp_bowtie_files 1049 | ``` 1050 | 1051 | 1052 | ```python 1053 | #output all sequences to a temporary FASTQ file for running bowtie alignment 1054 | fqFile = 'temp_bowtie_files/bowtie_input.fq' 1055 | 1056 | def outputTempBowtieFastq(libraryTable, outputFileName): 1057 | phredString = 'I4!=======44444+++++++' #weighting for how impactful mismatches are along sgRNA sequence 1058 | with open(outputFileName,'w') as outfile: 1059 | for name, row in libraryTable.iterrows(): 1060 | outfile.write('@' + name + '\n') 1061 | outfile.write('CCN' + str(Seq.Seq(row['sequence'][1:]).reverse_complement()) + '\n') 1062 | outfile.write('+\n') 1063 | outfile.write(phredString + '\n') 1064 | 1065 | outputTempBowtieFastq(libraryTable_new, fqFile) 1066 | ``` 1067 | 1068 | 1069 | ```python 1070 | import subprocess 1071 | 1072 | #specifying a list of parameters to run bowtie with 1073 | #each tuple contains 1074 | # *the mismatch threshold below which a site is considered a potential off-target (higher is more stringent) 1075 | # *the number of sites allowed (1 is minimum since each sgRNA should have one true site in genome) 1076 | # *the genome index against which to align the sgRNA sequences; these can be custom built to only consider sites near TSSs 1077 | # *a name for the bowtie run to create appropriately named output files 1078 | alignmentList = [(39,1,'large_data_files/hg19.ensemblTSSflank500b','39_nearTSS'), 1079 | (31,1,'large_data_files/hg19.ensemblTSSflank500b','31_nearTSS'), 1080 | (21,1,'large_data_files/hg19_maskChrMandPAR','21_genome'), 1081 | (31,2,'large_data_files/hg19.ensemblTSSflank500b','31_2_nearTSS'), 1082 | (31,3,'large_data_files/hg19.ensemblTSSflank500b','31_3_nearTSS')] 1083 | 1084 | alignmentColumns = [] 1085 | for btThreshold, mflag, bowtieIndex, runname in alignmentList: 1086 | 1087 | alignedFile = 'temp_bowtie_files/' + runname + '_aligned.txt' 1088 | unalignedFile = 'temp_bowtie_files/' + runname + '_unaligned.fq' 1089 | maxFile = 'temp_bowtie_files/' + runname + '_max.fq' 1090 | 1091 | bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile 1092 | print bowtieString 1093 | print subprocess.call(bowtieString, shell=True) #0 means finished without errors 1094 | 1095 | #parse through the file of sgRNAs that exceeded "m", the maximum allowable alignments, and mark "True" any that are found 1096 | try: 1097 | with open(maxFile) as infile: 1098 | sgsAligning = set() 1099 | for i, line in enumerate(infile): 1100 | if i%4 == 0: #id line 1101 | sgsAligning.add(line.strip()[1:]) 1102 | except IOError: #no sgRNAs exceeded m, so no maxFile created 1103 | sgsAligning = set() 1104 | 1105 | alignmentColumns.append(libraryTable_new.apply(lambda row: row.name in sgsAligning, axis=1)) 1106 | 1107 | #collate results into a table, and flip the boolean values to yield the sgRNAs that passed filter as True 1108 | alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3]).ne(True) 1109 | ``` 1110 | 1111 | bowtie -n 3 -l 15 -e 39 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/39_nearTSS_unaligned.fq --max temp_bowtie_files/39_nearTSS_max.fq -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/39_nearTSS_aligned.txt 1112 | 0 1113 | bowtie -n 3 -l 15 -e 31 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/31_nearTSS_unaligned.fq --max temp_bowtie_files/31_nearTSS_max.fq -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/31_nearTSS_aligned.txt 1114 | 0 1115 | bowtie -n 3 -l 15 -e 21 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19_maskChrMandPAR --suppress 5,6,7 --un temp_bowtie_files/21_genome_unaligned.fq --max temp_bowtie_files/21_genome_max.fq -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/21_genome_aligned.txt 1116 | 0 1117 | bowtie -n 3 -l 15 -e 31 -m 2 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/31_2_nearTSS_unaligned.fq --max temp_bowtie_files/31_2_nearTSS_max.fq -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/31_2_nearTSS_aligned.txt 1118 | 0 1119 | bowtie -n 3 -l 15 -e 31 -m 3 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/31_3_nearTSS_unaligned.fq --max temp_bowtie_files/31_3_nearTSS_max.fq -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/31_3_nearTSS_aligned.txt 1120 | 0 1121 | 1122 | 1123 | 1124 | ```python 1125 | alignmentTable.head() #True = passed threshold 1126 | ``` 1127 | 1128 | 1129 | 1130 | 1131 |
1132 | 1133 | 1134 | 1135 | 1136 | 1137 | 1138 | 1139 | 1140 | 1141 | 1142 | 1143 | 1144 | 1145 | 1146 | 1147 | 1148 | 1149 | 1150 | 1151 | 1152 | 1153 | 1154 | 1155 | 1156 | 1157 | 1158 | 1159 | 1160 | 1161 | 1162 | 1163 | 1164 | 1165 | 1166 | 1167 | 1168 | 1169 | 1170 | 1171 | 1172 | 1173 | 1174 | 1175 | 1176 | 1177 | 1178 | 1179 | 1180 | 1181 | 1182 | 1183 | 1184 | 1185 | 1186 | 1187 | 1188 | 1189 | 1190 | 1191 | 1192 | 1193 |
39_nearTSS31_nearTSS21_genome31_2_nearTSS31_3_nearTSS
sgId
AADACL2_+_151451720.23-P1P2TrueTrueFalseTrueTrue
AADACL2_+_151451732.23-P1P2TrueTrueTrueTrueTrue
AADACL2_+_151451733.23-P1P2TrueTrueTrueTrueTrue
AADACL2_-_151451809.23-P1P2FalseTrueFalseTrueTrue
AADACL2_-_151451816.23-P1P2TrueTrueFalseTrueTrue
1194 |
1195 | 1196 | 1197 | 1198 | ## Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering 1199 | 1200 | 1201 | ```python 1202 | #combine all generated data into one master table 1203 | predictedScores_new.name = 'predicted score' 1204 | v2Table = pd.concat((libraryTable_new, predictedScores_new, alignmentTable, sgInfoTable_new), axis=1, keys=['library table v2', 'predicted score', 'off-target filters', 'sgRNA info']) 1205 | ``` 1206 | 1207 | 1208 | ```python 1209 | import re 1210 | #for our pCRISPRi/a-v2 vector, we append flanking sequences to each sgRNA sequence for cloning and require the oligo to contain 1211 | #exactly 1 BstXI and BlpI site each for cloning, and exactly 0 SbfI sites for sequencing sample preparation 1212 | restrictionSites = {re.compile('CCA......TGG'):1, 1213 | re.compile('GCT.AGC'):1, 1214 | re.compile('CCTGCAGG'):0} 1215 | 1216 | def matchREsites(sequence, REdict): 1217 | seq = sequence.upper() 1218 | for resite, numMatchesExpected in restrictionSites.iteritems(): 1219 | if len(resite.findall(seq)) != numMatchesExpected: 1220 | return False 1221 | 1222 | return True 1223 | 1224 | def checkOverlaps(leftPosition, acceptedLeftPositions, nonoverlapMin): 1225 | for pos in acceptedLeftPositions: 1226 | if abs(pos - leftPosition) < nonoverlapMin: 1227 | return False 1228 | return True 1229 | ``` 1230 | 1231 | 1232 | ```python 1233 | #flanking sequences 1234 | upstreamConstant = 'CCACCTTGTTG' 1235 | downstreamConstant = 'GTTTAAGAGCTAAGCTG' 1236 | 1237 | #minimum overlap between two sgRNAs targeting the same TSS 1238 | nonoverlapMin = 3 1239 | 1240 | #number of sgRNAs to pick per gene/TSS 1241 | sgRNAsToPick = 10 1242 | 1243 | #list of off-target filter (or combinations of filters) levels, matching the names in the alignment table above 1244 | offTargetLevels = [['31_nearTSS', '21_genome'], 1245 | ['31_nearTSS'], 1246 | ['21_genome'], 1247 | ['31_2_nearTSS'], 1248 | ['31_3_nearTSS']] 1249 | 1250 | #for each gene/TSS, go through each sgRNA in descending order of predicted score 1251 | #if an sgRNA passes the restriction site, overlap, and off-target filters, accept it into the library 1252 | #if the number of sgRNAs accepted is less than sgRNAsToPick, reduce off-target stringency by one and continue 1253 | v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')]) 1254 | newSgIds = [] 1255 | unfinishedTss = [] 1256 | for (gene, transcript), group in v2Groups: 1257 | geneSgIds = [] 1258 | geneLeftPositions = [] 1259 | empiricalSgIds = dict() 1260 | 1261 | stringency = 0 1262 | 1263 | while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels): 1264 | for sgId_v2, row in group.sort_values(('predicted score','predicted score'), ascending=False).iterrows(): 1265 | oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant 1266 | leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0) 1267 | if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \ 1268 | and matchREsites(oligoSeq, restrictionSites) \ 1269 | and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin): 1270 | geneSgIds.append((sgId_v2, 1271 | gene,transcript, 1272 | row[('library table v2','sequence')], oligoSeq, 1273 | row[('predicted score','predicted score')], np.nan, 1274 | stringency)) 1275 | geneLeftPositions.append(leftPos) 1276 | 1277 | stringency += 1 1278 | 1279 | if len(geneSgIds) < sgRNAsToPick: 1280 | unfinishedTss.append((gene, transcript)) #if the number of accepted sgRNAs is still less than sgRNAsToPick, discard gene 1281 | else: 1282 | newSgIds.extend(geneSgIds) 1283 | 1284 | libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence', 1285 | 'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID') 1286 | ``` 1287 | 1288 | 1289 | ```python 1290 | print len(libraryTable_complete) 1291 | ``` 1292 | 1293 | 80 1294 | 1295 | 1296 | 1297 | ```python 1298 | #number of sgRNAs accepted at each stringency level 1299 | libraryTable_complete.groupby('off-target stringency').agg(len).iloc[:,0] 1300 | ``` 1301 | 1302 | 1303 | 1304 | 1305 | off-target stringency 1306 | 0 80 1307 | Name: gene, dtype: int64 1308 | 1309 | 1310 | 1311 | 1312 | ```python 1313 | #number of TSSs with fewer than required number of sgRNAs (and thus not included in the library) 1314 | print len(unfinishedTss) 1315 | ``` 1316 | 1317 | 0 1318 | 1319 | 1320 | 1321 | ```python 1322 | libraryTable_complete.head() 1323 | ``` 1324 | 1325 | 1326 | 1327 | 1328 |
1329 | 1330 | 1331 | 1332 | 1333 | 1334 | 1335 | 1336 | 1337 | 1338 | 1339 | 1340 | 1341 | 1342 | 1343 | 1344 | 1345 | 1346 | 1347 | 1348 | 1349 | 1350 | 1351 | 1352 | 1353 | 1354 | 1355 | 1356 | 1357 | 1358 | 1359 | 1360 | 1361 | 1362 | 1363 | 1364 | 1365 | 1366 | 1367 | 1368 | 1369 | 1370 | 1371 | 1372 | 1373 | 1374 | 1375 | 1376 | 1377 | 1378 | 1379 | 1380 | 1381 | 1382 | 1383 | 1384 | 1385 | 1386 | 1387 | 1388 | 1389 | 1390 | 1391 | 1392 | 1393 | 1394 | 1395 | 1396 | 1397 | 1398 | 1399 | 1400 | 1401 | 1402 | 1403 | 1404 |
genetranscriptprotospacer sequenceoligo sequencepredicted scoreempirical scoreoff-target stringency
sgID
AADACL2_+_151451732.23-P1P2AADACL2P1P2GGTAGAGCAATTGTAGACTTCCACCTTGTTGGGTAGAGCAATTGTAGACTTGTTTAAGAGCTAAGCTG0.693926NaN0
AADACL2_-_151452019.23-P1P2AADACL2P1P2GATGACTTATTGACTAAAAACCACCTTGTTGGATGACTTATTGACTAAAAAGTTTAAGAGCTAAGCTG0.451392NaN0
AADACL2_+_151452121.23-P1P2AADACL2P1P2GACTGTTACTCACAGATATACCACCTTGTTGGACTGTTACTCACAGATATAGTTTAAGAGCTAAGCTG0.426695NaN0
AADACL2_-_151451828.23-P1P2AADACL2P1P2GTGGAAAAAGGGATATTATGCCACCTTGTTGGTGGAAAAAGGGATATTATGGTTTAAGAGCTAAGCTG0.404655NaN0
AADACL2_-_151451931.23-P1P2AADACL2P1P2GAGCTGGAAAATAATGGCCTCCACCTTGTTGGAGCTGGAAAATAATGGCCTGTTTAAGAGCTAAGCTG0.404269NaN0
1405 |
1406 | 1407 | 1408 | 1409 | # 5. Design negative controls matching the base composition of the library 1410 | 1411 | 1412 | ```python 1413 | #calcluate the base frequency at each position of the sgRNA, then generate random sequences weighted by this frequency 1414 | def getBaseFrequencies(libraryTable, baseConversion = {'G':0, 'C':1, 'T':2, 'A':3}): 1415 | baseArray = np.zeros((len(libraryTable),20)) 1416 | 1417 | for i, (index, seq) in enumerate(libraryTable['protospacer sequence'].iteritems()): 1418 | for j, char in enumerate(seq.upper()): 1419 | baseArray[i,j] = baseConversion[char] 1420 | 1421 | baseTable = pd.DataFrame(baseArray, index = libraryTable.index) 1422 | 1423 | baseFrequencies = baseTable.apply(lambda col: col.groupby(col).agg(len)).fillna(0) / len(baseTable) 1424 | baseFrequencies.index = ['G','C','T','A'] 1425 | 1426 | baseCumulativeFrequencies = baseFrequencies.copy() 1427 | baseCumulativeFrequencies.loc['C'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] 1428 | baseCumulativeFrequencies.loc['T'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T'] 1429 | baseCumulativeFrequencies.loc['A'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T'] + baseFrequencies.loc['A'] 1430 | 1431 | return baseFrequencies, baseCumulativeFrequencies 1432 | 1433 | def generateRandomSequence(baseCumulativeFrequencies): 1434 | randArray = np.random.random(baseCumulativeFrequencies.shape[1]) 1435 | 1436 | seq = [] 1437 | for i, col in baseCumulativeFrequencies.iteritems(): 1438 | for base, freq in col.iteritems(): 1439 | if randArray[i] < freq: 1440 | seq.append(base) 1441 | break 1442 | 1443 | return ''.join(seq) 1444 | ``` 1445 | 1446 | 1447 | ```python 1448 | baseCumulativeFrequencies = getBaseFrequencies(libraryTable_complete)[1] 1449 | negList = [] 1450 | numberToGenerate = 1000 #can generate many more; some will be filtered out by off-targets, and you can always select an arbitrary subset for inclusion into the library 1451 | for i in range(numberToGenerate): 1452 | negList.append(generateRandomSequence(baseCumulativeFrequencies)) 1453 | negTable = pd.DataFrame(negList, index=['non-targeting_' + str(i) for i in range(numberToGenerate)], columns = ['sequence']) 1454 | 1455 | fqFile = 'temp_bowtie_files/bowtie_input_negs.fq' 1456 | outputTempBowtieFastq(negTable, fqFile) 1457 | ``` 1458 | 1459 | 1460 | ```python 1461 | #similar to targeting sgRNA off-target scoring, but looking for sgRNAs with 0 alignments 1462 | alignmentList = [(31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS_negs'), 1463 | (21,1,'~/indices/hg19_maskChrMandPAR','21_genome_negs')] 1464 | 1465 | alignmentColumns = [] 1466 | for btThreshold, mflag, bowtieIndex, runname in alignmentList: 1467 | 1468 | alignedFile = 'temp_bowtie_files/' + runname + '_aligned.txt' 1469 | unalignedFile = 'temp_bowtie_files/' + runname + '_unaligned.fq' 1470 | maxFile = 'temp_bowtie_files/' + runname + '_max.fq' 1471 | 1472 | bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile 1473 | print bowtieString 1474 | print subprocess.call(bowtieString, shell=True) 1475 | 1476 | #read unaligned file for negs, and then don't flip boolean of alignmentTable 1477 | with open(unalignedFile) as infile: 1478 | sgsAligning = set() 1479 | for i, line in enumerate(infile): 1480 | if i%4 == 0: #id line 1481 | sgsAligning.add(line.strip()[1:]) 1482 | 1483 | alignmentColumns.append(negTable.apply(lambda row: row.name in sgsAligning, axis=1)) 1484 | 1485 | alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3]) 1486 | alignmentTable.head() 1487 | ``` 1488 | 1489 | bowtie -n 3 -l 15 -e 31 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 ~/indices/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/31_nearTSS_negs_unaligned.fq --max temp_bowtie_files/31_nearTSS_negs_max.fq -q temp_bowtie_files/bowtie_input_negs.fq temp_bowtie_files/31_nearTSS_negs_aligned.txt 1490 | 0 1491 | bowtie -n 3 -l 15 -e 21 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 ~/indices/hg19_maskChrMandPAR --suppress 5,6,7 --un temp_bowtie_files/21_genome_negs_unaligned.fq --max temp_bowtie_files/21_genome_negs_max.fq -q temp_bowtie_files/bowtie_input_negs.fq temp_bowtie_files/21_genome_negs_aligned.txt 1492 | 0 1493 | 1494 | 1495 | 1496 | 1497 | 1498 |
1499 | 1500 | 1501 | 1502 | 1503 | 1504 | 1505 | 1506 | 1507 | 1508 | 1509 | 1510 | 1511 | 1512 | 1513 | 1514 | 1515 | 1516 | 1517 | 1518 | 1519 | 1520 | 1521 | 1522 | 1523 | 1524 | 1525 | 1526 | 1527 | 1528 | 1529 | 1530 | 1531 | 1532 | 1533 | 1534 |
31_nearTSS_negs21_genome_negs
non-targeting_0TrueTrue
non-targeting_1TrueTrue
non-targeting_2TrueTrue
non-targeting_3TrueFalse
non-targeting_4TrueTrue
1535 |
1536 | 1537 | 1538 | 1539 | 1540 | ```python 1541 | acceptedNegList = [] 1542 | negCount = 0 1543 | for i, (name, row) in enumerate(pd.concat((negTable,alignmentTable),axis=1, keys=['seq','alignment']).iterrows()): 1544 | oligo = upstreamConstant + row['seq','sequence'] + downstreamConstant 1545 | if row['alignment'].all() and matchREsites(oligo, restrictionSites): 1546 | acceptedNegList.append(('non-targeting_%05d' % negCount, 'negative_control', 'na', row['seq','sequence'], oligo, 0)) 1547 | negCount += 1 1548 | 1549 | acceptedNegs = pd.DataFrame(acceptedNegList, columns = ['sgId', 'gene', 'transcript', 'protospacer sequence', 'oligo sequence', 'off-target stringency']).set_index('sgId') 1550 | ``` 1551 | 1552 | 1553 | ```python 1554 | acceptedNegs.head() 1555 | ``` 1556 | 1557 | 1558 | 1559 | 1560 |
1561 | 1562 | 1563 | 1564 | 1565 | 1566 | 1567 | 1568 | 1569 | 1570 | 1571 | 1572 | 1573 | 1574 | 1575 | 1576 | 1577 | 1578 | 1579 | 1580 | 1581 | 1582 | 1583 | 1584 | 1585 | 1586 | 1587 | 1588 | 1589 | 1590 | 1591 | 1592 | 1593 | 1594 | 1595 | 1596 | 1597 | 1598 | 1599 | 1600 | 1601 | 1602 | 1603 | 1604 | 1605 | 1606 | 1607 | 1608 | 1609 | 1610 | 1611 | 1612 | 1613 | 1614 | 1615 | 1616 | 1617 | 1618 | 1619 | 1620 | 1621 | 1622 |
genetranscriptprotospacer sequenceoligo sequenceoff-target stringency
sgId
non-targeting_00000negative_controlnaGGTTTCGCGCGCTTACAGATCCACCTTGTTGGGTTTCGCGCGCTTACAGATGTTTAAGAGCTAAGCTG0
non-targeting_00001negative_controlnaGGTGGTCGAAGATAGCGAGCCCACCTTGTTGGGTGGTCGAAGATAGCGAGCGTTTAAGAGCTAAGCTG0
non-targeting_00002negative_controlnaGCTCTTGACAAATTCAAGCTCCACCTTGTTGGCTCTTGACAAATTCAAGCTGTTTAAGAGCTAAGCTG0
non-targeting_00003negative_controlnaGGCCGGGAGAGCGGGAACTCCCACCTTGTTGGGCCGGGAGAGCGGGAACTCGTTTAAGAGCTAAGCTG0
non-targeting_00004negative_controlnaGTCGCAAGCCGGGGTAGGGTCCACCTTGTTGGTCGCAAGCCGGGGTAGGGTGTTTAAGAGCTAAGCTG0
1623 |
1624 | 1625 | 1626 | 1627 | 1628 | ```python 1629 | libraryTable_complete.to_csv(LIBRARY_WITHOUT_NEGATIVES_PATH, sep='\t') 1630 | acceptedNegs.to_csv(NEGATIVE_CONTROLS_PATH,sep='\t') 1631 | ``` 1632 | 1633 | # 6. Finalizing library design 1634 | 1635 | * divide genes into sublibrary groups (if required) 1636 | * assign negative control sgRNAs to sublibrary groups; ~1-2% of the number of sgRNAs in the library is a good rule-of-thumb 1637 | * append PCR adapter sequences (~18bp) to each end of the oligo sequences to enable amplification of the oligo pool; each sublibary should have an orthogonal sequence so they can be cloned separately 1638 | 1639 | 1640 | ```python 1641 | 1642 | ``` 1643 | --------------------------------------------------------------------------------