├── .gitignore
├── README.md
├── CRISPRiaDesign_example_notebook.md
├── CRISPRiaDesign_example_notebook.ipynb
├── sgRNA_learning.py
└── Library_design_walkthrough.md
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | #files too large for github
7 | large_data_files/
8 |
9 | # C extensions
10 | *.so
11 |
12 | # Distribution / packaging
13 | .Python
14 | env/
15 | build/
16 | develop-eggs/
17 | dist/
18 | downloads/
19 | eggs/
20 | .eggs/
21 | lib/
22 | lib64/
23 | parts/
24 | sdist/
25 | var/
26 | *.egg-info/
27 | .installed.cfg
28 | *.egg
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .coverage
44 | .coverage.*
45 | .cache
46 | nosetests.xml
47 | coverage.xml
48 | *,cover
49 | .hypothesis/
50 |
51 | # Translations
52 | *.mo
53 | *.pot
54 |
55 | # Django stuff:
56 | *.log
57 | local_settings.py
58 |
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 |
63 | # Scrapy stuff:
64 | .scrapy
65 |
66 | # Sphinx documentation
67 | docs/_build/
68 |
69 | # PyBuilder
70 | target/
71 |
72 | # IPython Notebook
73 | .ipynb_checkpoints
74 |
75 | # pyenv
76 | .python-version
77 |
78 | # celery beat schedule file
79 | celerybeat-schedule
80 |
81 | # dotenv
82 | .env
83 |
84 | # virtualenv
85 | venv/
86 | ENV/
87 |
88 | # Spyder project settings
89 | .spyderproject
90 |
91 | # Rope project settings
92 | .ropeproject
93 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # CRISPRiaDesign
2 |
3 | This site hosts the sgRNA machine learning scripts used to generate the Weissman lab's next-generation CRISPRi and CRISPRa library designs [(Horlbeck et al., eLife 2016)](https://elifesciences.org/content/5/e19760). These are currently implemented as interactive scripts along with iPython notebooks with step-by-step instructions for creating new sgRNA libraries. Future plans include adding command line functions to make library design more user-friendly. Note that all sgRNA designs for CRISPRi/a human/mouse protein-coding gene libraries are included as supplementary tables in the eLife paper, so cloning of individual sgRNAs or construction of any custom sublibraries targeting protein-coding genes can simply refer to those tables. These scripts are primarily useful for the design of sgRNAs targeting novel or non-coding genes, or for organisms beyond human and mouse.
4 |
5 | **To apply the exact quantitative models used to generate the CRISPRi-v2 or CRISPRa-v2 libraries**, follow the steps outlined in the Library_design_walkthrough (included as a Jupyter notebook or [web page](Library_design_walkthrough.md)).
6 |
7 | To see full example code for de novo machine learning, prediction of sgRNA activity for desired loci, and construction of new genome-scale CRISPRi/a libraries, see the CRISPRiaDesign_example_notebook (included as Jupyter notebook or [web page](CRISPRiaDesign_example_notebook.md)).
8 |
9 | ### Dependencies
10 | * Python v2.7
11 | * Jupyter notebook
12 | * Biopython
13 | * Scipy/Numpy/Pandas
14 | * Scikit-learn
15 | * bxpython (v0.5.0, https://github.com/bxlab/bx-python)
16 | * Pysam
17 | * [ScreenProcessing](https://github.com/mhorlbeck/ScreenProcessing)
18 |
19 | External command line applications required:
20 | * ViennaRNA
21 | * Bowtie (not Bowtie2)
22 |
23 | Large genomic data files required:
24 |
25 | Links are to human genome files relied upon for the hCRISPRi-v2 and hCRISPRa-v2 machine learning--and required for the Library_design_walkthrough--but any organism/assembly may be used for design of new libraries or de novo machine learning. For convenience, **the files referenced in Library_design_walkthrough in the folder "large_data_files" are also available [here](https://ucsf.box.com/s/s4ds471in2ngjer7okavzf5cqf2ebrqj)**.
26 |
27 | * Genome sequence as FASTA ([hg19](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/))
28 | * FANTOM5 TSS annotation as BED ([TSS_human](http://fantom.gsc.riken.jp/5/datafiles/phase1.3/extra/TSS_classifier/))
29 | * Chromatin data as BigWig ([MNase](https://www.encodeproject.org/files/ENCFF000VNN/), [DNase](https://www.encodeproject.org/files/ENCFF000SVL/), [FAIRE-seq](https://www.encodeproject.org/files/ENCFF000TLU/))
30 | * HGNC table of gene aliases (not strictly required for the Library_design_walkthrough but useful in some steps)
31 | * Ensembl annotation as GTF (not strictly required for the Library_design_walkthrough but useful in some steps and in other functions; release 74 used for the published library designs)
32 |
--------------------------------------------------------------------------------
/CRISPRiaDesign_example_notebook.md:
--------------------------------------------------------------------------------
1 |
2 | 1. Learning sgRNA predictors from empirical data
3 | * Load scripts and empirical data
4 | * Generate TSS annotation using FANTOM dataset
5 | * Calculate parameters for empirical sgRNAs
6 | * Fit parameters
7 | 2. Applying machine learning model to predict sgRNA activity
8 | * Find all sgRNAs in genomic regions of interest
9 | * Predicting sgRNA activity
10 | 3. Construct sgRNA libraries
11 | * Score sgRNAs for off-target potential
12 | * Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering
13 | * Design negative controls matching the base composition of the library
14 | * Finalizing library design
15 |
16 | # 1. Learning sgRNA predictors from empirical data
17 | ## Load scripts and empirical data
18 |
19 |
20 | ```python
21 | %run sgRNA_learning.py
22 | ```
23 |
24 |
25 | ```python
26 | genomeDict = loadGenomeAsDict(FASTA_FILE_OF_GENOME)
27 | gencodeData = loadGencodeData(GTF_FILE_FROM_GENCODE)
28 | ```
29 |
30 |
31 | ```python
32 | #load empirical data as tables in the format generated by github.com/mhorlbeck/ScreenProcessing
33 | libraryTable, phenotypeTable, geneTable = loadExperimentData(PATHS_TO_DATA_GENERATED_BY_ScreenProcessing)
34 | ```
35 |
36 |
37 | ```python
38 | #extract genes that scored as hits, normalize phenotypes, and extract information on sgRNAs from the sgIDs
39 | discriminantTable = calculateDiscriminantScores(geneTable)
40 | normedScores, maxDiscriminantTable = getNormalizedsgRNAsOverThresh(libraryTable, phenotypeTable, discriminantTable,
41 | DISCRIMANT_THRESHOLD_eg20,
42 | 3, transcripts=False)
43 |
44 | libraryTable_subset = libraryTable.loc[normedScores.dropna().index]
45 | sgInfoTable = parseAllSgIds(libraryTable_subset)
46 | ```
47 |
48 | ## Generate TSS annotation using FANTOM dataset
49 |
50 |
51 | ```python
52 | #first generates a table of TSS annotations
53 | #legacy function to make an intermediate table for the "P1P2" annotation strategy, will be replaced in future versions
54 | #TSS_TABLE_BASED_ON_ENSEMBL is table without headers with columns:
55 | #gene, transcript, chromosome, TSS coordinate, strand, annotation_source(optional)
56 | tssTable = generateTssTable(geneTable, TSS_TABLE_BASED_ON_ENSEMBL, FANTOM_TSS_ANNOTATION_BED, 200)
57 | ```
58 |
59 |
60 | ```python
61 | #Now create a TSS annotation by searching for P1 and P2 peaks near annotated TSSs
62 | geneToAliases = generateAliasDict(HGNC_SYMBOL_LOOKUP_TABLE,gencodeData)
63 | p1p2Table = generateTssTable_P1P2strategy(tssTable.loc[tssTable.apply(lambda row: row.name[0][:6] != 'pseudo',axis=1)],
64 | FANTOM_TSS_ANNOTATION_BED,
65 | matchedp1p2Window = 30000, #region around supplied TSS annotation to search for a FANTOM P1 or P2 peak that matches the gene name (or alias)
66 | anyp1p2Window = 500, #region around supplied TSS annotation to search for the nearest P1 or P2 peak
67 | anyPeakWindow = 200, #region around supplied TSS annotation to search for any CAGE peak
68 | minDistanceForTwoTSS = 1000, #If a P1 and P2 peak are found, maximum distance at which to combine into a single annotation (with primary/secondary TSS positions)
69 | aliasDict = geneToAliases[0])
70 | #the function will report some collisions of IDs due to use of aliases and redundancy in genome, but will resolve these itself
71 | ```
72 |
73 |
74 | ```python
75 | #save tables for downstream use
76 | tssTable = pd.read_csv(TSS_TABLE_PATH,sep='\t', index_col=range(2))
77 | p1p2Table = pd.read_csv(P1P2_TABLE_PATH,sep='\t', header=0, index_col=range(2))
78 | ```
79 |
80 | ## Calculate parameters for empirical sgRNAs
81 |
82 |
83 | ```python
84 | #Load bigwig files for any chromatin data of interest
85 | bwhandleDict = {'dnase':BigWigFile(open('ENCODE_data/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')),
86 | 'faire':BigWigFile(open('ENCODE_data/wgEncodeOpenChromFaireK562Sig.bigWig')),
87 | 'mnase':BigWigFile(open('ENCODE_data/wgEncodeSydhNsomeK562Sig.bigWig'))}
88 | ```
89 |
90 |
91 | ```python
92 | paramTable_trainingGuides = generateTypicalParamTable(libraryTable_subset,sgInfoTable, tssTable, p1p2Table, genomeDict, bwhandleDict)
93 | ```
94 |
95 | ## Fit parameters
96 |
97 |
98 | ```python
99 | #populate table of fitting parameters
100 | typeList = ['binnable_onehot',
101 | 'continuous', 'continuous', 'continuous', 'continuous',
102 | 'binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot',
103 | 'binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot',
104 | 'binary']
105 | typeList.extend(['binary']*160)
106 | typeList.extend(['binary']*(16*38))
107 | typeList.extend(['binnable_onehot']*3)
108 | typeList.extend(['binnable_onehot']*2)
109 | typeList.extend(['binary']*18)
110 | fitTable = pd.DataFrame(typeList, index=paramTable_trainingGuides.columns, columns=['type'])
111 | fitparams =[{'bin width':1, 'min edge data':50, 'bin function':np.median},
112 | {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},
113 | {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},
114 | {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},
115 | {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},
116 | {'bin width':1, 'min edge data':50, 'bin function':np.median},
117 | {'bin width':1, 'min edge data':50, 'bin function':np.median},
118 | {'bin width':1, 'min edge data':50, 'bin function':np.median},
119 | {'bin width':1, 'min edge data':50, 'bin function':np.median},
120 | {'bin width':.1, 'min edge data':50, 'bin function':np.median},
121 | {'bin width':.1, 'min edge data':50, 'bin function':np.median},
122 | {'bin width':.1, 'min edge data':50, 'bin function':np.median},
123 | {'bin width':.1, 'min edge data':50, 'bin function':np.median},
124 | {'bin width':.1, 'min edge data':50, 'bin function':np.median},
125 | {'bin width':.1, 'min edge data':50, 'bin function':np.median},
126 | {'bin width':.1, 'min edge data':50, 'bin function':np.median},dict()]
127 | fitparams.extend([dict()]*160)
128 | fitparams.extend([dict()]*(16*38))
129 | fitparams.extend([
130 | {'bin width':.15, 'min edge data':50, 'bin function':np.median},
131 | {'bin width':.15, 'min edge data':50, 'bin function':np.median},
132 | {'bin width':.15, 'min edge data':50, 'bin function':np.median}])
133 | fitparams.extend([
134 | {'bin width':2, 'min edge data':50, 'bin function':np.median},
135 | {'bin width':2, 'min edge data':50, 'bin function':np.median}])
136 | fitparams.extend([dict()]*18)
137 | fitTable['params'] = fitparams
138 | ```
139 |
140 |
141 | ```python
142 | #divide empirical data into n-folds for cross-validation
143 | geneFoldList = getGeneFolds(libraryTable_subset, 5, transcripts=False)
144 | ```
145 |
146 |
147 | ```python
148 | #for each fold, fit parameters to training folds and measure ROC on test fold
149 | coefs = []
150 | scoreTups = []
151 | transformedParamTups = []
152 |
153 | for geneFold_train, geneFold_test in geneFoldList:
154 |
155 | transformedParams_train, estimators = fitParams(paramTable_trainingGuides.loc[normedScores.dropna().index].iloc[geneFold_train], normedScores.loc[normedScores.dropna().index].iloc[geneFold_train], fitTable)
156 |
157 | transformedParams_test = transformParams(paramTable_trainingGuides.loc[normedScores.dropna().index].iloc[geneFold_test], fitTable, estimators)
158 |
159 | reg = linear_model.ElasticNetCV(l1_ratio=[.5, .75, .9, .99,1], n_jobs=16, max_iter=2000)
160 |
161 | scaler = preprocessing.StandardScaler()
162 | reg.fit(scaler.fit_transform(transformedParams_train), normedScores.loc[normedScores.dropna().index].iloc[geneFold_train])
163 | predictedScores = pd.Series(reg.predict(scaler.transform(transformedParams_test)), index=transformedParams_test.index)
164 | testScores = normedScores.loc[normedScores.dropna().index].iloc[geneFold_test]
165 |
166 | transformedParamTups.append((scaler.transform(transformedParams_train),scaler.transform(transformedParams_test)))
167 | scoreTups.append((testScores, predictedScores))
168 |
169 | print 'Prediction AUC-ROC:', metrics.roc_auc_score((testScores >= .75).values, np.array(predictedScores.values,dtype='float64'))
170 | print 'Prediction R^2:', reg.score(scaler.transform(transformedParams_test), testScores)
171 | print 'Regression parameters:', reg.l1_ratio_, reg.alpha_
172 | coefs.append(pd.DataFrame(zip(*[abs(reg.coef_),reg.coef_]), index = transformedParams_test.columns, columns=['abs','true']))
173 | print 'Number of features used:', len(coefs[-1]) - sum(coefs[-1]['abs'] < .00000000001)
174 | ```
175 |
176 |
177 | ```python
178 | #can select an arbitrary fold (as shown here simply the last one tested) to save state for reproducing estimators later
179 | #the pickling of the scikit-learn estimators/regressors will allow the model to be reloaded for prediction of other guide designs,
180 | # but will not be compatible across scikit-learn versions, so it is important to preserve the training data and training/test folds
181 | import cPickle
182 | estimatorString = cPickle.dumps((fitTable, estimators, scaler, reg, (geneFold_train, geneFold_test)))
183 | with open(PICKLE_FILE,'w') as outfile:
184 | outfile.write(estimatorString)
185 |
186 | #also save the transformed parameters as these can slightly differ based on the automated binning strategy
187 | transformedParams_train.head().to_csv(TRANSFORMED_PARAM_HEADER,sep='\t')
188 | ```
189 |
190 | # 2. Applying machine learning model to predict sgRNA activity
191 |
192 |
193 | ```python
194 | #starting from a new session for demonstration purposes:
195 | %run sgRNA_learning.py
196 | import cPickle
197 |
198 | #load tssTable, p1p2Table, genome sequence, chromatin data
199 | tssTable = pd.read_csv(TSS_TABLE_PATH,sep='\t', index_col=range(2))
200 |
201 | p1p2Table = pd.read_csv(P1P2_TABLE_PATH,sep='\t', header=0, index_col=range(2))
202 | p1p2Table['primary TSS'] = p1p2Table['primary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0]), int(tupString.strip('()').split(', ')[1])))
203 | p1p2Table['secondary TSS'] = p1p2Table['secondary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0]),int(tupString.strip('()').split(', ')[1])))
204 |
205 | genomeDict = loadGenomeAsDict(FASTA_FILE_OF_GENOME)
206 |
207 | bwhandleDict = {'dnase':BigWigFile(open('ENCODE_data/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')),
208 | 'faire':BigWigFile(open('ENCODE_data/wgEncodeOpenChromFaireK562Sig.bigWig')),
209 | 'mnase':BigWigFile(open('ENCODE_data/wgEncodeSydhNsomeK562Sig.bigWig'))}
210 |
211 | #load sgRNA prediction model saved after the parameter fitting step
212 | with open(PICKLE_FILE) as infile:
213 | fitTable, estimators, scaler, reg, (geneFold_train, geneFold_test) = cPickle.load(infile)
214 |
215 | transformedParamHeader = pd.read_csv(TRANSFORMED_PARAM_HEADER,sep='\t')
216 | ```
217 |
218 | ## Find all sgRNAs in genomic regions of interest
219 |
220 |
221 | ```python
222 | #use the same p1p2Table as above or generate a new one for novel TSSs
223 | libraryTable_new, sgInfoTable_new = findAllGuides(p1p2Table, genomeDict, (-25,500))
224 | ```
225 |
226 |
227 | ```python
228 | #alternately, load tables of sgRNAs to score:
229 | libraryTable_new = pd.read_csv(LIBRARY_TABLE_PATH,sep='\t',index_col=0)
230 | sgInfoTable_new = pd.read_csv(SGINFO_TABLE_PATH,sep='\t',index_col=0)
231 | ```
232 |
233 | ## Predicting sgRNA activity
234 |
235 |
236 | ```python
237 | #calculate parameters for new sgRNAs
238 | paramTable_new = generateTypicalParamTable(libraryTable_new, sgInfoTable_new, tssTable, p1p2Table, genomeDict, bwhandleDict)
239 | ```
240 |
241 |
242 | ```python
243 | #transform and predict scores according to sgRNA prediction model
244 | transformedParams_new = transformParams(paramTable_new, fitTable, estimators)
245 |
246 | #reconcile any differences in column headers generated by automated binning
247 | colTups = []
248 | for (l1, l2), col in transformedParams_new.iteritems():
249 | colTups.append((l1,str(l2)))
250 | transformedParams_new.columns = pd.MultiIndex.from_tuples(colTups)
251 |
252 | predictedScores_new = pd.Series(reg.predict(scaler.transform(transformedParams_new.loc[:, transformedParamHeader.columns].fillna(0).values)), index=transformedParams_new.index)
253 | ```
254 |
255 |
256 | ```python
257 | predictedScores_new.to_csv(PREDICTED_SCORE_TABLE, sep='\t')
258 | ```
259 |
260 | # 3. Construct sgRNA libraries
261 | ## Score sgRNAs for off-target potential
262 |
263 |
264 | ```python
265 | #There are many ways to score sgRNAs as off-target; below is one listed one method that is simple and flexible,
266 | #but ignores gapped alignments, alternate PAMs, and uses bowtie which may not be maximally sensitive in all cases
267 | ```
268 |
269 |
270 | ```python
271 | #output all sequences to a temporary FASTQ file for running bowtie alignment
272 | def outputTempBowtieFastq(libraryTable, outputFileName):
273 | phredString = 'I4!=======44444+++++++' #weighting for how impactful mismatches are along sgRNA sequence
274 | with open(outputFileName,'w') as outfile:
275 | for name, row in libraryTable.iterrows():
276 | outfile.write('@' + name + '\n')
277 | outfile.write('CCN' + str(Seq.Seq(row['sequence'][1:]).reverse_complement()) + '\n')
278 | outfile.write('+\n')
279 | outfile.write(phredString + '\n')
280 |
281 | outputTempBowtieFastq(libraryTable_new, TEMP_FASTQ_FILE)
282 | ```
283 |
284 |
285 | ```python
286 | import subprocess
287 | fqFile = TEMP_FASTQ_FILE
288 |
289 | #specifying a list of parameters to run bowtie with
290 | #each tuple contains
291 | # *the mismatch threshold below which a site is considered a potential off-target (higher is more stringent)
292 | # *the number of sites allowed (1 is minimum since each sgRNA should have one true site in genome)
293 | # *the genome index against which to align the sgRNA sequences; these can be custom built to only consider sites near TSSs
294 | # *a name for the bowtie run to create appropriately named output files
295 | alignmentList = [(39,1,'~/indices/hg19.ensemblTSSflank500b','39_nearTSS'),
296 | (31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS'),
297 | (21,1,'~/indices/hg19.maskChrMandPAR','21_genome'),
298 | (31,2,'~/indices/hg19.ensemblTSSflank500b','31_2_nearTSS'),
299 | (31,3,'~/indices/hg19.ensemblTSSflank500b','31_3_nearTSS')]
300 |
301 | alignmentColumns = []
302 | for btThreshold, mflag, bowtieIndex, runname in alignmentList:
303 |
304 | alignedFile = 'bowtie_output/' + runname + '_aligned.txt'
305 | unalignedFile = 'bowtie_output/' + runname + '_unaligned.fq'
306 | maxFile = 'bowtie_output/' + runname + '_max.fq'
307 |
308 | bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile
309 | print bowtieString
310 | print subprocess.call(bowtieString, shell=True)
311 |
312 | #parse through the file of sgRNAs that exceeded "m", the maximum allowable alignments, and mark "True" any that are found
313 | with open(maxFile) as infile:
314 | sgsAligning = set()
315 | for i, line in enumerate(infile):
316 | if i%4 == 0: #id line
317 | sgsAligning.add(line.strip()[1:])
318 |
319 | alignmentColumns.append(libraryTable_new.apply(lambda row: row.name in sgsAligning, axis=1))
320 |
321 | #collate results into a table, and flip the boolean values to yield the sgRNAs that passed filter as True
322 | alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3]).ne(True)
323 | ```
324 |
325 | ## Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering
326 |
327 |
328 | ```python
329 | #combine all generated data into one master table
330 | predictedScores_new.name = 'predicted score'
331 | v2Table = pd.concat((libraryTable_new, predictedScores_new, alignmentTable, sgInfoTable_new), axis=1, keys=['library table v2', 'predicted score', 'off-target filters', 'sgRNA info'])
332 | ```
333 |
334 |
335 | ```python
336 | import re
337 | #for our pCRISPRi/a-v2 vector, we append flanking sequences to each sgRNA sequence for cloning and require the oligo to contain
338 | #exactly 1 BstXI and BlpI site each for cloning, and exactly 0 SbfI sites for sequencing sample preparation
339 | restrictionSites = {re.compile('CCA......TGG'):1,
340 | re.compile('GCT.AGC'):1,
341 | re.compile('CCTGCAGG'):0}
342 |
343 | def matchREsites(sequence, REdict):
344 | seq = sequence.upper()
345 | for resite, numMatchesExpected in restrictionSites.iteritems():
346 | if len(resite.findall(seq)) != numMatchesExpected:
347 | return False
348 |
349 | return True
350 |
351 | def checkOverlaps(leftPosition, acceptedLeftPositions, nonoverlapMin):
352 | for pos in acceptedLeftPositions:
353 | if abs(pos - leftPosition) < nonoverlapMin:
354 | return False
355 | return True
356 | ```
357 |
358 |
359 | ```python
360 | #flanking sequences
361 | upstreamConstant = 'CCACCTTGTTG'
362 | downstreamConstant = 'GTTTAAGAGCTAAGCTG'
363 |
364 | #minimum overlap between two sgRNAs targeting the same TSS
365 | nonoverlapMin = 3
366 |
367 | #number of sgRNAs to pick per gene/TSS
368 | sgRNAsToPick = 10
369 |
370 | #list of off-target filter (or combinations of filters) levels, matching the names in the alignment table above
371 | offTargetLevels = [['31_nearTSS', '21_genome'],
372 | ['31_nearTSS'],
373 | ['21_genome'],
374 | ['31_2_nearTSS'],
375 | ['31_3_nearTSS']]
376 |
377 | #for each gene/TSS, go through each sgRNA in descending order of predicted score
378 | #if an sgRNA passes the restriction site, overlap, and off-target filters, accept it into the library
379 | #if the number of sgRNAs accepted is less than sgRNAsToPick, reduce off-target stringency by one and continue
380 | v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')])
381 | newSgIds = []
382 | unfinishedTss = []
383 | for (gene, transcript), group in v2Groups:
384 | geneSgIds = []
385 | geneLeftPositions = []
386 | empiricalSgIds = dict()
387 |
388 | stringency = 0
389 |
390 | while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels):
391 | for sgId_v2, row in group.sort(('predicted score','predicted score'), ascending=False).iterrows():
392 | oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant
393 | leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)
394 | if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \
395 | and matchREsites(oligoSeq, restrictionSites) \
396 | and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):
397 | geneSgIds.append((sgId_v2,
398 | gene,transcript,
399 | row[('library table v2','sequence')], oligoSeq,
400 | row[('predicted score','predicted score')], np.nan,
401 | stringency))
402 | geneLeftPositions.append(leftPos)
403 |
404 | stringency += 1
405 |
406 | if len(geneSgIds) < sgRNAsToPick:
407 | unfinishedTss.append((gene, transcript)) #if the number of accepted sgRNAs is still less than sgRNAsToPick, discard gene
408 | else:
409 | newSgIds.extend(geneSgIds)
410 |
411 | libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence',
412 | 'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID')
413 | ```
414 |
415 |
416 | ```python
417 | #number of sgRNAs accepted at each stringency level
418 | newLibraryTable.groupby('off-target stringency').agg(len).iloc[:,0]
419 | ```
420 |
421 |
422 | ```python
423 | #number of TSSs with fewer than required number of sgRNAs (and thus not included in the library)
424 | print len(unfinishedTss)
425 | ```
426 |
427 |
428 | ```python
429 | #Note that empirical information from previous screens can be included as well--for example:
430 | geneToDisc = maxDiscriminantTable['best score'].groupby(level=0).agg(max).to_dict()
431 | thresh = 7
432 | empiricalBonus = .2
433 |
434 | upstreamConstant = 'CCACCTTGTTG'
435 | downstreamConstant = 'GTTTAAGAGCTAAGCTG'
436 |
437 | nonoverlapMin = 3
438 |
439 | sgRNAsToPick = 10
440 |
441 | offTargetLevels = [['31_nearTSS', '21_genome'],
442 | ['31_nearTSS'],
443 | ['21_genome'],
444 | ['31_2_nearTSS'],
445 | ['31_3_nearTSS']]
446 | offTargetLevels_v1 = [[s + '_v1' for s in l] for l in offTargetLevels]
447 |
448 | v1Groups = v1Table.groupby([('relative position','gene'),('relative position','transcript')])
449 | v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')])
450 |
451 | newSgIds = []
452 | unfinishedTss = []
453 | for (gene, transcript), group in v2Groups:
454 | geneSgIds = []
455 | geneLeftPositions = []
456 | empiricalSgIds = dict()
457 |
458 | stringency = 0
459 |
460 | while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels):
461 |
462 | if gene in geneToDisc and geneToDisc[gene] >= thresh and (gene, transcript) in v1Groups.groups:
463 |
464 | for sgId_v1, row in v1Groups.get_group((gene, transcript)).sort(('Empirical activity score','Empirical activity score'),ascending=False).iterrows():
465 | oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant
466 | leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)
467 | if len(geneSgIds) < sgRNAsToPick and min(abs(row.loc['relative position'].iloc[2:])) < 5000 \
468 | and row[('Empirical activity score','Empirical activity score')] >= .75 \
469 | and row['off-target filters'].loc[offTargetLevels_v1[stringency]].all() \
470 | and matchREsites(oligoSeq, restrictionSites) \
471 | and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):
472 | if len(geneSgIds) < 2:
473 | geneSgIds.append((row[('library table v2','sgId_v2')],
474 | gene,transcript,
475 | row[('library table v2','sequence')], oligoSeq,
476 | np.nan,row[('Empirical activity score','Empirical activity score')],
477 | stringency))
478 | geneLeftPositions.append(leftPos)
479 |
480 | empiricalSgIds[row[('library table v2','sgId_v2')]] = row[('Empirical activity score','Empirical activity score')]
481 |
482 | adjustedScores = group.apply(lambda row: row[('predicted score','CRISPRiv2 predicted score')] + empiricalBonus if row.name in empiricalSgIds else row[('predicted score','CRISPRiv2 predicted score')], axis=1)
483 | adjustedScores.name = ('adjusted score','')
484 | for sgId_v2, row in pd.concat((group,adjustedScores),axis=1).sort(('adjusted score',''), ascending=False).iterrows():
485 | oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant
486 | leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)
487 | if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \
488 | and matchREsites(oligoSeq, restrictionSites) \
489 | and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):
490 | geneSgIds.append((sgId_v2,
491 | gene,transcript,
492 | row[('library table v2','sequence')], oligoSeq,
493 | row[('predicted score','CRISPRiv2 predicted score')], empiricalSgIds[sgId_v2] if sgId_v2 in empiricalSgIds else np.nan,
494 | stringency))
495 | geneLeftPositions.append(leftPos)
496 |
497 | stringency += 1
498 |
499 | if len(geneSgIds) < sgRNAsToPick:
500 | unfinishedTss.append((gene, transcript))
501 | else:
502 | newSgIds.extend(geneSgIds)
503 |
504 |
505 | libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence',
506 | 'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID')
507 | ```
508 |
509 | ## Design negative controls matching the base composition of the library
510 |
511 |
512 | ```python
513 | #calcluate the base frequency at each position of the sgRNA, then generate random sequences weighted by this frequency
514 | def getBaseFrequencies(libraryTable, baseConversion = {'G':0, 'C':1, 'T':2, 'A':3}):
515 | baseArray = np.zeros((len(libraryTable),20))
516 |
517 | for i, (index, seq) in enumerate(libraryTable['protospacer sequence'].iteritems()):
518 | for j, char in enumerate(seq.upper()):
519 | baseArray[i,j] = baseConversion[char]
520 |
521 | baseTable = pd.DataFrame(baseArray, index = libraryTable.index)
522 |
523 | baseFrequencies = baseTable.apply(lambda col: col.groupby(col).agg(len)).fillna(0) / len(baseTable)
524 | baseFrequencies.index = ['G','C','T','A']
525 |
526 | baseCumulativeFrequencies = baseFrequencies.copy()
527 | baseCumulativeFrequencies.loc['C'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C']
528 | baseCumulativeFrequencies.loc['T'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T']
529 | baseCumulativeFrequencies.loc['A'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T'] + baseFrequencies.loc['A']
530 |
531 | return baseFrequencies, baseCumulativeFrequencies
532 |
533 | def generateRandomSequence(baseCumulativeFrequencies):
534 | randArray = np.random.random(baseCumulativeFrequencies.shape[1])
535 |
536 | seq = []
537 | for i, col in baseCumulativeFrequencies.iteritems():
538 | for base, freq in col.iteritems():
539 | if randArray[i] < freq:
540 | seq.append(base)
541 | break
542 |
543 | return ''.join(seq)
544 | ```
545 |
546 |
547 | ```python
548 | baseCumulativeFrequencies = getBaseFrequencies(libraryTable_complete)[1]
549 | negList = []
550 | for i in range(30000):
551 | negList.append(generateRandomSequence(baseCumulativeFrequencies))
552 | negTable = pd.DataFrame(negList, index=['non-targeting_' + str(i) for i in range(30000)], columns = ['sequence'])
553 |
554 | outputTempBowtieFastq(negTable, TEMP_FASTQ_FILE)
555 | ```
556 |
557 |
558 | ```python
559 | #similar to targeting sgRNA off-target scoring, but looking for sgRNAs with 0 alignments
560 | fqFile = TEMP_FASTQ_FILE
561 |
562 | alignmentList = [(31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS_negs'),
563 | (21,1,'~/indices/hg19.maskChrMandPAR','21_genome_negs')]
564 |
565 | alignmentColumns = []
566 | for btThreshold, mflag, bowtieIndex, runname in alignmentList:
567 |
568 | alignedFile = 'bowtie_output/' + runname + '_aligned.txt'
569 | unalignedFile = 'bowtie_output//' + runname + '_unaligned.fq'
570 | maxFile = 'bowtie_output/' + runname + '_max.fq'
571 |
572 | bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile
573 | print bowtieString
574 | print subprocess.call(bowtieString, shell=True)
575 |
576 | #read unaligned file for negs, and then don't flip boolean of alignmentTable
577 | with open(unalignedFile) as infile:
578 | sgsAligning = set()
579 | for i, line in enumerate(infile):
580 | if i%4 == 0: #id line
581 | sgsAligning.add(line.strip()[1:])
582 |
583 | alignmentColumns.append(negTable.apply(lambda row: row.name in sgsAligning, axis=1))
584 |
585 | alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3])
586 | alignmentTable.head()
587 | ```
588 |
589 |
590 | ```python
591 | acceptedNegList = []
592 | negCount = 0
593 | for i, (name, row) in enumerate(pd.concat((negTable,alignmentTable),axis=1, keys=['seq','alignment']).iterrows()):
594 | oligo = upstreamConstant + row['seq','sequence'] + downstreamConstant
595 | if row['alignment'].all() and matchREsites(oligo, restrictionSites):
596 | acceptedNegList.append(('non-targeting_%05d' % negCount, 'negative_control', 'na', row['seq','sequence'], oligo, 0))
597 | negCount += 1
598 |
599 | acceptedNegs = pd.DataFrame(acceptedNegList, columns = ['sgId', 'gene', 'transcript', 'protospacer sequence', 'oligo sequence', 'off-target stringency']).set_index('sgId')
600 | ```
601 |
602 | ## Finalizing library design
603 |
604 | * divide genes into sublibrary groups (if required)
605 | * assign negative control sgRNAs to sublibrary groups; ~1-2% of the number of sgRNAs in the library is a good rule-of-thumb
606 | * append PCR adapter sequences (~18bp) to each end of the oligo sequences to enable amplification of the oligo pool; each sublibary should have an orthogonal sequence so they can be cloned separately
607 |
608 |
609 | ```python
610 |
611 | ```
612 |
--------------------------------------------------------------------------------
/CRISPRiaDesign_example_notebook.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "1. Learning sgRNA predictors from empirical data\n",
8 | " * Load scripts and empirical data\n",
9 | " * Generate TSS annotation using FANTOM dataset\n",
10 | " * Calculate parameters for empirical sgRNAs\n",
11 | " * Fit parameters\n",
12 | "2. Applying machine learning model to predict sgRNA activity\n",
13 | " * Find all sgRNAs in genomic regions of interest \n",
14 | " * Predicting sgRNA activity\n",
15 | "3. Construct sgRNA libraries\n",
16 | " * Score sgRNAs for off-target potential\n",
17 | "* Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering\n",
18 | "* Design negative controls matching the base composition of the library\n",
19 | "* Finalizing library design"
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "# 1. Learning sgRNA predictors from empirical data\n",
27 | "## Load scripts and empirical data"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": null,
33 | "metadata": {
34 | "collapsed": true
35 | },
36 | "outputs": [],
37 | "source": [
38 | "%run sgRNA_learning.py"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "metadata": {
45 | "collapsed": true
46 | },
47 | "outputs": [],
48 | "source": [
49 | "genomeDict = loadGenomeAsDict(FASTA_FILE_OF_GENOME)\n",
50 | "gencodeData = loadGencodeData(GTF_FILE_FROM_GENCODE)"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": null,
56 | "metadata": {
57 | "collapsed": true
58 | },
59 | "outputs": [],
60 | "source": [
61 | "#load empirical data as tables in the format generated by github.com/mhorlbeck/ScreenProcessing\n",
62 | "libraryTable, phenotypeTable, geneTable = loadExperimentData(PATHS_TO_DATA_GENERATED_BY_ScreenProcessing)"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": null,
68 | "metadata": {
69 | "collapsed": true
70 | },
71 | "outputs": [],
72 | "source": [
73 | "#extract genes that scored as hits, normalize phenotypes, and extract information on sgRNAs from the sgIDs\n",
74 | "discriminantTable = calculateDiscriminantScores(geneTable)\n",
75 | "normedScores, maxDiscriminantTable = getNormalizedsgRNAsOverThresh(libraryTable, phenotypeTable, discriminantTable, \n",
76 | " DISCRIMANT_THRESHOLD_eg20,\n",
77 | " 3, transcripts=False)\n",
78 | "\n",
79 | "libraryTable_subset = libraryTable.loc[normedScores.dropna().index]\n",
80 | "sgInfoTable = parseAllSgIds(libraryTable_subset)"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "## Generate TSS annotation using FANTOM dataset"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {
94 | "collapsed": true
95 | },
96 | "outputs": [],
97 | "source": [
98 | "#first generates a table of TSS annotations\n",
99 | "#legacy function to make an intermediate table for the \"P1P2\" annotation strategy, will be replaced in future versions\n",
100 | "#TSS_TABLE_BASED_ON_ENSEMBL is table without headers with columns:\n",
101 | "#gene, transcript, chromosome, TSS coordinate, strand, annotation_source(optional)\n",
102 | "tssTable = generateTssTable(geneTable, TSS_TABLE_BASED_ON_ENSEMBL, FANTOM_TSS_ANNOTATION_BED, 200)"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": null,
108 | "metadata": {
109 | "collapsed": true
110 | },
111 | "outputs": [],
112 | "source": [
113 | "#Now create a TSS annotation by searching for P1 and P2 peaks near annotated TSSs\n",
114 | "geneToAliases = generateAliasDict(HGNC_SYMBOL_LOOKUP_TABLE,gencodeData)\n",
115 | "p1p2Table = generateTssTable_P1P2strategy(tssTable.loc[tssTable.apply(lambda row: row.name[0][:6] != 'pseudo',axis=1)],\n",
116 | " FANTOM_TSS_ANNOTATION_BED, \n",
117 | " matchedp1p2Window = 30000, #region around supplied TSS annotation to search for a FANTOM P1 or P2 peak that matches the gene name (or alias)\n",
118 | " anyp1p2Window = 500, #region around supplied TSS annotation to search for the nearest P1 or P2 peak\n",
119 | " anyPeakWindow = 200, #region around supplied TSS annotation to search for any CAGE peak\n",
120 | " minDistanceForTwoTSS = 1000, #If a P1 and P2 peak are found, maximum distance at which to combine into a single annotation (with primary/secondary TSS positions)\n",
121 | " aliasDict = geneToAliases[0])\n",
122 | "#the function will report some collisions of IDs due to use of aliases and redundancy in genome, but will resolve these itself"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": null,
128 | "metadata": {
129 | "collapsed": true
130 | },
131 | "outputs": [],
132 | "source": [
133 | "#save tables for downstream use\n",
134 | "tssTable = pd.read_csv(TSS_TABLE_PATH,sep='\\t', index_col=range(2))\n",
135 | "p1p2Table = pd.read_csv(P1P2_TABLE_PATH,sep='\\t', header=0, index_col=range(2))"
136 | ]
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {
141 | "collapsed": true
142 | },
143 | "source": [
144 | "## Calculate parameters for empirical sgRNAs"
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": null,
150 | "metadata": {
151 | "collapsed": true
152 | },
153 | "outputs": [],
154 | "source": [
155 | "#Load bigwig files for any chromatin data of interest\n",
156 | "bwhandleDict = {'dnase':BigWigFile(open('ENCODE_data/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')),\n",
157 | "'faire':BigWigFile(open('ENCODE_data/wgEncodeOpenChromFaireK562Sig.bigWig')),\n",
158 | "'mnase':BigWigFile(open('ENCODE_data/wgEncodeSydhNsomeK562Sig.bigWig'))}"
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": null,
164 | "metadata": {
165 | "collapsed": true
166 | },
167 | "outputs": [],
168 | "source": [
169 | "paramTable_trainingGuides = generateTypicalParamTable(libraryTable_subset,sgInfoTable, tssTable, p1p2Table, genomeDict, bwhandleDict)"
170 | ]
171 | },
172 | {
173 | "cell_type": "markdown",
174 | "metadata": {},
175 | "source": [
176 | "## Fit parameters"
177 | ]
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": null,
182 | "metadata": {
183 | "collapsed": true
184 | },
185 | "outputs": [],
186 | "source": [
187 | "#populate table of fitting parameters\n",
188 | "typeList = ['binnable_onehot', \n",
189 | " 'continuous', 'continuous', 'continuous', 'continuous',\n",
190 | " 'binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot',\n",
191 | " 'binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot',\n",
192 | " 'binary']\n",
193 | "typeList.extend(['binary']*160)\n",
194 | "typeList.extend(['binary']*(16*38))\n",
195 | "typeList.extend(['binnable_onehot']*3)\n",
196 | "typeList.extend(['binnable_onehot']*2)\n",
197 | "typeList.extend(['binary']*18)\n",
198 | "fitTable = pd.DataFrame(typeList, index=paramTable_trainingGuides.columns, columns=['type'])\n",
199 | "fitparams =[{'bin width':1, 'min edge data':50, 'bin function':np.median},\n",
200 | " {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},\n",
201 | " {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},\n",
202 | " {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},\n",
203 | " {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},\n",
204 | " {'bin width':1, 'min edge data':50, 'bin function':np.median},\n",
205 | " {'bin width':1, 'min edge data':50, 'bin function':np.median},\n",
206 | " {'bin width':1, 'min edge data':50, 'bin function':np.median},\n",
207 | " {'bin width':1, 'min edge data':50, 'bin function':np.median},\n",
208 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n",
209 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n",
210 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n",
211 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n",
212 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n",
213 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n",
214 | " {'bin width':.1, 'min edge data':50, 'bin function':np.median},dict()]\n",
215 | "fitparams.extend([dict()]*160)\n",
216 | "fitparams.extend([dict()]*(16*38))\n",
217 | "fitparams.extend([\n",
218 | " {'bin width':.15, 'min edge data':50, 'bin function':np.median},\n",
219 | " {'bin width':.15, 'min edge data':50, 'bin function':np.median},\n",
220 | " {'bin width':.15, 'min edge data':50, 'bin function':np.median}])\n",
221 | "fitparams.extend([\n",
222 | " {'bin width':2, 'min edge data':50, 'bin function':np.median},\n",
223 | " {'bin width':2, 'min edge data':50, 'bin function':np.median}])\n",
224 | "fitparams.extend([dict()]*18)\n",
225 | "fitTable['params'] = fitparams"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "metadata": {
232 | "collapsed": true
233 | },
234 | "outputs": [],
235 | "source": [
236 | "#divide empirical data into n-folds for cross-validation\n",
237 | "geneFoldList = getGeneFolds(libraryTable_subset, 5, transcripts=False)"
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": null,
243 | "metadata": {
244 | "collapsed": true
245 | },
246 | "outputs": [],
247 | "source": [
248 | "#for each fold, fit parameters to training folds and measure ROC on test fold\n",
249 | "coefs = []\n",
250 | "scoreTups = []\n",
251 | "transformedParamTups = []\n",
252 | "\n",
253 | "for geneFold_train, geneFold_test in geneFoldList:\n",
254 | "\n",
255 | " transformedParams_train, estimators = fitParams(paramTable_trainingGuides.loc[normedScores.dropna().index].iloc[geneFold_train], normedScores.loc[normedScores.dropna().index].iloc[geneFold_train], fitTable)\n",
256 | "\n",
257 | " transformedParams_test = transformParams(paramTable_trainingGuides.loc[normedScores.dropna().index].iloc[geneFold_test], fitTable, estimators)\n",
258 | " \n",
259 | " reg = linear_model.ElasticNetCV(l1_ratio=[.5, .75, .9, .99,1], n_jobs=16, max_iter=2000)\n",
260 | " \n",
261 | " scaler = preprocessing.StandardScaler()\n",
262 | " reg.fit(scaler.fit_transform(transformedParams_train), normedScores.loc[normedScores.dropna().index].iloc[geneFold_train])\n",
263 | " predictedScores = pd.Series(reg.predict(scaler.transform(transformedParams_test)), index=transformedParams_test.index)\n",
264 | " testScores = normedScores.loc[normedScores.dropna().index].iloc[geneFold_test]\n",
265 | " \n",
266 | " transformedParamTups.append((scaler.transform(transformedParams_train),scaler.transform(transformedParams_test)))\n",
267 | " scoreTups.append((testScores, predictedScores))\n",
268 | " \n",
269 | " print 'Prediction AUC-ROC:', metrics.roc_auc_score((testScores >= .75).values, np.array(predictedScores.values,dtype='float64'))\n",
270 | " print 'Prediction R^2:', reg.score(scaler.transform(transformedParams_test), testScores)\n",
271 | " print 'Regression parameters:', reg.l1_ratio_, reg.alpha_\n",
272 | " coefs.append(pd.DataFrame(zip(*[abs(reg.coef_),reg.coef_]), index = transformedParams_test.columns, columns=['abs','true']))\n",
273 | " print 'Number of features used:', len(coefs[-1]) - sum(coefs[-1]['abs'] < .00000000001)"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "metadata": {
280 | "collapsed": true
281 | },
282 | "outputs": [],
283 | "source": [
284 | "#can select an arbitrary fold (as shown here simply the last one tested) to save state for reproducing estimators later\n",
285 | "#the pickling of the scikit-learn estimators/regressors will allow the model to be reloaded for prediction of other guide designs, \n",
286 | "# but will not be compatible across scikit-learn versions, so it is important to preserve the training data and training/test folds\n",
287 | "import cPickle\n",
288 | "estimatorString = cPickle.dumps((fitTable, estimators, scaler, reg, (geneFold_train, geneFold_test)))\n",
289 | "with open(PICKLE_FILE,'w') as outfile:\n",
290 | " outfile.write(estimatorString)\n",
291 | " \n",
292 | "#also save the transformed parameters as these can slightly differ based on the automated binning strategy\n",
293 | "transformedParams_train.head().to_csv(TRANSFORMED_PARAM_HEADER,sep='\\t')"
294 | ]
295 | },
296 | {
297 | "cell_type": "markdown",
298 | "metadata": {
299 | "collapsed": true
300 | },
301 | "source": [
302 | "# 2. Applying machine learning model to predict sgRNA activity"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": null,
308 | "metadata": {
309 | "collapsed": true
310 | },
311 | "outputs": [],
312 | "source": [
313 | "#starting from a new session for demonstration purposes:\n",
314 | "%run sgRNA_learning.py\n",
315 | "import cPickle\n",
316 | "\n",
317 | "#load tssTable, p1p2Table, genome sequence, chromatin data\n",
318 | "tssTable = pd.read_csv(TSS_TABLE_PATH,sep='\\t', index_col=range(2))\n",
319 | "\n",
320 | "p1p2Table = pd.read_csv(P1P2_TABLE_PATH,sep='\\t', header=0, index_col=range(2))\n",
321 | "p1p2Table['primary TSS'] = p1p2Table['primary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0]), int(tupString.strip('()').split(', ')[1])))\n",
322 | "p1p2Table['secondary TSS'] = p1p2Table['secondary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0]),int(tupString.strip('()').split(', ')[1])))\n",
323 | "\n",
324 | "genomeDict = loadGenomeAsDict(FASTA_FILE_OF_GENOME)\n",
325 | "\n",
326 | "bwhandleDict = {'dnase':BigWigFile(open('ENCODE_data/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')),\n",
327 | "'faire':BigWigFile(open('ENCODE_data/wgEncodeOpenChromFaireK562Sig.bigWig')),\n",
328 | "'mnase':BigWigFile(open('ENCODE_data/wgEncodeSydhNsomeK562Sig.bigWig'))}\n",
329 | "\n",
330 | "#load sgRNA prediction model saved after the parameter fitting step\n",
331 | "with open(PICKLE_FILE) as infile:\n",
332 | " fitTable, estimators, scaler, reg, (geneFold_train, geneFold_test) = cPickle.load(infile)\n",
333 | " \n",
334 | "transformedParamHeader = pd.read_csv(TRANSFORMED_PARAM_HEADER,sep='\\t')"
335 | ]
336 | },
337 | {
338 | "cell_type": "markdown",
339 | "metadata": {},
340 | "source": [
341 | "## Find all sgRNAs in genomic regions of interest "
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "execution_count": null,
347 | "metadata": {
348 | "collapsed": true
349 | },
350 | "outputs": [],
351 | "source": [
352 | "#use the same p1p2Table as above or generate a new one for novel TSSs\n",
353 | "libraryTable_new, sgInfoTable_new = findAllGuides(p1p2Table, genomeDict, (-25,500))"
354 | ]
355 | },
356 | {
357 | "cell_type": "code",
358 | "execution_count": null,
359 | "metadata": {
360 | "collapsed": true
361 | },
362 | "outputs": [],
363 | "source": [
364 | "#alternately, load tables of sgRNAs to score:\n",
365 | "libraryTable_new = pd.read_csv(LIBRARY_TABLE_PATH,sep='\\t',index_col=0)\n",
366 | "sgInfoTable_new = pd.read_csv(SGINFO_TABLE_PATH,sep='\\t',index_col=0)"
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "## Predicting sgRNA activity"
374 | ]
375 | },
376 | {
377 | "cell_type": "code",
378 | "execution_count": null,
379 | "metadata": {
380 | "collapsed": true
381 | },
382 | "outputs": [],
383 | "source": [
384 | "#calculate parameters for new sgRNAs\n",
385 | "paramTable_new = generateTypicalParamTable(libraryTable_new, sgInfoTable_new, tssTable, p1p2Table, genomeDict, bwhandleDict)"
386 | ]
387 | },
388 | {
389 | "cell_type": "code",
390 | "execution_count": null,
391 | "metadata": {
392 | "collapsed": true
393 | },
394 | "outputs": [],
395 | "source": [
396 | "#transform and predict scores according to sgRNA prediction model\n",
397 | "transformedParams_new = transformParams(paramTable_new, fitTable, estimators)\n",
398 | "\n",
399 | "#reconcile any differences in column headers generated by automated binning\n",
400 | "colTups = []\n",
401 | "for (l1, l2), col in transformedParams_new.iteritems():\n",
402 | " colTups.append((l1,str(l2)))\n",
403 | "transformedParams_new.columns = pd.MultiIndex.from_tuples(colTups)\n",
404 | "\n",
405 | "predictedScores_new = pd.Series(reg.predict(scaler.transform(transformedParams_new.loc[:, transformedParamHeader.columns].fillna(0).values)), index=transformedParams_new.index)"
406 | ]
407 | },
408 | {
409 | "cell_type": "code",
410 | "execution_count": null,
411 | "metadata": {
412 | "collapsed": true
413 | },
414 | "outputs": [],
415 | "source": [
416 | "predictedScores_new.to_csv(PREDICTED_SCORE_TABLE, sep='\\t')"
417 | ]
418 | },
419 | {
420 | "cell_type": "markdown",
421 | "metadata": {},
422 | "source": [
423 | "# 3. Construct sgRNA libraries\n",
424 | "## Score sgRNAs for off-target potential"
425 | ]
426 | },
427 | {
428 | "cell_type": "code",
429 | "execution_count": null,
430 | "metadata": {
431 | "collapsed": true
432 | },
433 | "outputs": [],
434 | "source": [
435 | "#There are many ways to score sgRNAs as off-target; below is one listed one method that is simple and flexible,\n",
436 | "#but ignores gapped alignments, alternate PAMs, and uses bowtie which may not be maximally sensitive in all cases"
437 | ]
438 | },
439 | {
440 | "cell_type": "code",
441 | "execution_count": null,
442 | "metadata": {
443 | "collapsed": true
444 | },
445 | "outputs": [],
446 | "source": [
447 | "#output all sequences to a temporary FASTQ file for running bowtie alignment\n",
448 | "def outputTempBowtieFastq(libraryTable, outputFileName):\n",
449 | " phredString = 'I4!=======44444+++++++' #weighting for how impactful mismatches are along sgRNA sequence \n",
450 | " with open(outputFileName,'w') as outfile:\n",
451 | " for name, row in libraryTable.iterrows():\n",
452 | " outfile.write('@' + name + '\\n')\n",
453 | " outfile.write('CCN' + str(Seq.Seq(row['sequence'][1:]).reverse_complement()) + '\\n')\n",
454 | " outfile.write('+\\n')\n",
455 | " outfile.write(phredString + '\\n')\n",
456 | " \n",
457 | "outputTempBowtieFastq(libraryTable_new, TEMP_FASTQ_FILE)"
458 | ]
459 | },
460 | {
461 | "cell_type": "code",
462 | "execution_count": null,
463 | "metadata": {
464 | "collapsed": true
465 | },
466 | "outputs": [],
467 | "source": [
468 | "import subprocess\n",
469 | "fqFile = TEMP_FASTQ_FILE\n",
470 | "\n",
471 | "#specifying a list of parameters to run bowtie with\n",
472 | "#each tuple contains\n",
473 | "# *the mismatch threshold below which a site is considered a potential off-target (higher is more stringent)\n",
474 | "# *the number of sites allowed (1 is minimum since each sgRNA should have one true site in genome)\n",
475 | "# *the genome index against which to align the sgRNA sequences; these can be custom built to only consider sites near TSSs\n",
476 | "# *a name for the bowtie run to create appropriately named output files\n",
477 | "alignmentList = [(39,1,'~/indices/hg19.ensemblTSSflank500b','39_nearTSS'),\n",
478 | " (31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS'),\n",
479 | " (21,1,'~/indices/hg19.maskChrMandPAR','21_genome'),\n",
480 | " (31,2,'~/indices/hg19.ensemblTSSflank500b','31_2_nearTSS'),\n",
481 | " (31,3,'~/indices/hg19.ensemblTSSflank500b','31_3_nearTSS')]\n",
482 | "\n",
483 | "alignmentColumns = []\n",
484 | "for btThreshold, mflag, bowtieIndex, runname in alignmentList:\n",
485 | "\n",
486 | " alignedFile = 'bowtie_output/' + runname + '_aligned.txt'\n",
487 | " unalignedFile = 'bowtie_output/' + runname + '_unaligned.fq'\n",
488 | " maxFile = 'bowtie_output/' + runname + '_max.fq'\n",
489 | " \n",
490 | " bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile\n",
491 | " print bowtieString\n",
492 | " print subprocess.call(bowtieString, shell=True)\n",
493 | "\n",
494 | " #parse through the file of sgRNAs that exceeded \"m\", the maximum allowable alignments, and mark \"True\" any that are found\n",
495 | " with open(maxFile) as infile:\n",
496 | " sgsAligning = set()\n",
497 | " for i, line in enumerate(infile):\n",
498 | " if i%4 == 0: #id line\n",
499 | " sgsAligning.add(line.strip()[1:])\n",
500 | "\n",
501 | " alignmentColumns.append(libraryTable_new.apply(lambda row: row.name in sgsAligning, axis=1))\n",
502 | " \n",
503 | "#collate results into a table, and flip the boolean values to yield the sgRNAs that passed filter as True\n",
504 | "alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3]).ne(True)"
505 | ]
506 | },
507 | {
508 | "cell_type": "markdown",
509 | "metadata": {},
510 | "source": [
511 | "## Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering"
512 | ]
513 | },
514 | {
515 | "cell_type": "code",
516 | "execution_count": null,
517 | "metadata": {
518 | "collapsed": true
519 | },
520 | "outputs": [],
521 | "source": [
522 | "#combine all generated data into one master table\n",
523 | "predictedScores_new.name = 'predicted score'\n",
524 | "v2Table = pd.concat((libraryTable_new, predictedScores_new, alignmentTable, sgInfoTable_new), axis=1, keys=['library table v2', 'predicted score', 'off-target filters', 'sgRNA info'])"
525 | ]
526 | },
527 | {
528 | "cell_type": "code",
529 | "execution_count": null,
530 | "metadata": {
531 | "collapsed": true
532 | },
533 | "outputs": [],
534 | "source": [
535 | "import re\n",
536 | "#for our pCRISPRi/a-v2 vector, we append flanking sequences to each sgRNA sequence for cloning and require the oligo to contain\n",
537 | "#exactly 1 BstXI and BlpI site each for cloning, and exactly 0 SbfI sites for sequencing sample preparation\n",
538 | "restrictionSites = {re.compile('CCA......TGG'):1,\n",
539 | " re.compile('GCT.AGC'):1,\n",
540 | " re.compile('CCTGCAGG'):0}\n",
541 | "\n",
542 | "def matchREsites(sequence, REdict):\n",
543 | " seq = sequence.upper()\n",
544 | " for resite, numMatchesExpected in restrictionSites.iteritems():\n",
545 | " if len(resite.findall(seq)) != numMatchesExpected:\n",
546 | " return False\n",
547 | " \n",
548 | " return True\n",
549 | "\n",
550 | "def checkOverlaps(leftPosition, acceptedLeftPositions, nonoverlapMin):\n",
551 | " for pos in acceptedLeftPositions:\n",
552 | " if abs(pos - leftPosition) < nonoverlapMin:\n",
553 | " return False\n",
554 | " return True"
555 | ]
556 | },
557 | {
558 | "cell_type": "code",
559 | "execution_count": null,
560 | "metadata": {
561 | "collapsed": true
562 | },
563 | "outputs": [],
564 | "source": [
565 | "#flanking sequences\n",
566 | "upstreamConstant = 'CCACCTTGTTG'\n",
567 | "downstreamConstant = 'GTTTAAGAGCTAAGCTG'\n",
568 | "\n",
569 | "#minimum overlap between two sgRNAs targeting the same TSS\n",
570 | "nonoverlapMin = 3\n",
571 | "\n",
572 | "#number of sgRNAs to pick per gene/TSS\n",
573 | "sgRNAsToPick = 10\n",
574 | "\n",
575 | "#list of off-target filter (or combinations of filters) levels, matching the names in the alignment table above\n",
576 | "offTargetLevels = [['31_nearTSS', '21_genome'],\n",
577 | " ['31_nearTSS'],\n",
578 | " ['21_genome'],\n",
579 | " ['31_2_nearTSS'],\n",
580 | " ['31_3_nearTSS']]\n",
581 | "\n",
582 | "#for each gene/TSS, go through each sgRNA in descending order of predicted score\n",
583 | "#if an sgRNA passes the restriction site, overlap, and off-target filters, accept it into the library\n",
584 | "#if the number of sgRNAs accepted is less than sgRNAsToPick, reduce off-target stringency by one and continue\n",
585 | "v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')])\n",
586 | "newSgIds = []\n",
587 | "unfinishedTss = []\n",
588 | "for (gene, transcript), group in v2Groups:\n",
589 | " geneSgIds = []\n",
590 | " geneLeftPositions = []\n",
591 | " empiricalSgIds = dict()\n",
592 | " \n",
593 | " stringency = 0\n",
594 | " \n",
595 | " while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels):\n",
596 | " for sgId_v2, row in group.sort(('predicted score','predicted score'), ascending=False).iterrows():\n",
597 | " oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant\n",
598 | " leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)\n",
599 | " if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \\\n",
600 | " and matchREsites(oligoSeq, restrictionSites) \\\n",
601 | " and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):\n",
602 | " geneSgIds.append((sgId_v2,\n",
603 | " gene,transcript,\n",
604 | " row[('library table v2','sequence')], oligoSeq,\n",
605 | " row[('predicted score','predicted score')], np.nan,\n",
606 | " stringency))\n",
607 | " geneLeftPositions.append(leftPos)\n",
608 | " \n",
609 | " stringency += 1\n",
610 | " \n",
611 | " if len(geneSgIds) < sgRNAsToPick:\n",
612 | " unfinishedTss.append((gene, transcript)) #if the number of accepted sgRNAs is still less than sgRNAsToPick, discard gene\n",
613 | " else:\n",
614 | " newSgIds.extend(geneSgIds)\n",
615 | " \n",
616 | "libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence',\n",
617 | " 'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID')"
618 | ]
619 | },
620 | {
621 | "cell_type": "code",
622 | "execution_count": null,
623 | "metadata": {
624 | "collapsed": true
625 | },
626 | "outputs": [],
627 | "source": [
628 | "#number of sgRNAs accepted at each stringency level\n",
629 | "newLibraryTable.groupby('off-target stringency').agg(len).iloc[:,0]"
630 | ]
631 | },
632 | {
633 | "cell_type": "code",
634 | "execution_count": null,
635 | "metadata": {
636 | "collapsed": true
637 | },
638 | "outputs": [],
639 | "source": [
640 | "#number of TSSs with fewer than required number of sgRNAs (and thus not included in the library)\n",
641 | "print len(unfinishedTss)"
642 | ]
643 | },
644 | {
645 | "cell_type": "code",
646 | "execution_count": null,
647 | "metadata": {
648 | "collapsed": true
649 | },
650 | "outputs": [],
651 | "source": [
652 | "#Note that empirical information from previous screens can be included as well--for example:\n",
653 | "geneToDisc = maxDiscriminantTable['best score'].groupby(level=0).agg(max).to_dict()\n",
654 | "thresh = 7\n",
655 | "empiricalBonus = .2\n",
656 | "\n",
657 | "upstreamConstant = 'CCACCTTGTTG'\n",
658 | "downstreamConstant = 'GTTTAAGAGCTAAGCTG'\n",
659 | "\n",
660 | "nonoverlapMin = 3\n",
661 | "\n",
662 | "sgRNAsToPick = 10\n",
663 | "\n",
664 | "offTargetLevels = [['31_nearTSS', '21_genome'],\n",
665 | " ['31_nearTSS'],\n",
666 | " ['21_genome'],\n",
667 | " ['31_2_nearTSS'],\n",
668 | " ['31_3_nearTSS']]\n",
669 | "offTargetLevels_v1 = [[s + '_v1' for s in l] for l in offTargetLevels]\n",
670 | "\n",
671 | "v1Groups = v1Table.groupby([('relative position','gene'),('relative position','transcript')])\n",
672 | "v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')])\n",
673 | "\n",
674 | "newSgIds = []\n",
675 | "unfinishedTss = []\n",
676 | "for (gene, transcript), group in v2Groups:\n",
677 | " geneSgIds = []\n",
678 | " geneLeftPositions = []\n",
679 | " empiricalSgIds = dict()\n",
680 | " \n",
681 | " stringency = 0\n",
682 | " \n",
683 | " while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels):\n",
684 | " \n",
685 | " if gene in geneToDisc and geneToDisc[gene] >= thresh and (gene, transcript) in v1Groups.groups:\n",
686 | "\n",
687 | " for sgId_v1, row in v1Groups.get_group((gene, transcript)).sort(('Empirical activity score','Empirical activity score'),ascending=False).iterrows():\n",
688 | " oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant\n",
689 | " leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)\n",
690 | " if len(geneSgIds) < sgRNAsToPick and min(abs(row.loc['relative position'].iloc[2:])) < 5000 \\\n",
691 | " and row[('Empirical activity score','Empirical activity score')] >= .75 \\\n",
692 | " and row['off-target filters'].loc[offTargetLevels_v1[stringency]].all() \\\n",
693 | " and matchREsites(oligoSeq, restrictionSites) \\\n",
694 | " and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):\n",
695 | " if len(geneSgIds) < 2:\n",
696 | " geneSgIds.append((row[('library table v2','sgId_v2')],\n",
697 | " gene,transcript,\n",
698 | " row[('library table v2','sequence')], oligoSeq,\n",
699 | " np.nan,row[('Empirical activity score','Empirical activity score')],\n",
700 | " stringency))\n",
701 | " geneLeftPositions.append(leftPos)\n",
702 | "\n",
703 | " empiricalSgIds[row[('library table v2','sgId_v2')]] = row[('Empirical activity score','Empirical activity score')]\n",
704 | "\n",
705 | " adjustedScores = group.apply(lambda row: row[('predicted score','CRISPRiv2 predicted score')] + empiricalBonus if row.name in empiricalSgIds else row[('predicted score','CRISPRiv2 predicted score')], axis=1)\n",
706 | " adjustedScores.name = ('adjusted score','')\n",
707 | " for sgId_v2, row in pd.concat((group,adjustedScores),axis=1).sort(('adjusted score',''), ascending=False).iterrows():\n",
708 | " oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant\n",
709 | " leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)\n",
710 | " if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \\\n",
711 | " and matchREsites(oligoSeq, restrictionSites) \\\n",
712 | " and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):\n",
713 | " geneSgIds.append((sgId_v2,\n",
714 | " gene,transcript,\n",
715 | " row[('library table v2','sequence')], oligoSeq,\n",
716 | " row[('predicted score','CRISPRiv2 predicted score')], empiricalSgIds[sgId_v2] if sgId_v2 in empiricalSgIds else np.nan,\n",
717 | " stringency))\n",
718 | " geneLeftPositions.append(leftPos)\n",
719 | " \n",
720 | " stringency += 1\n",
721 | " \n",
722 | " if len(geneSgIds) < sgRNAsToPick:\n",
723 | " unfinishedTss.append((gene, transcript))\n",
724 | " else:\n",
725 | " newSgIds.extend(geneSgIds)\n",
726 | "\n",
727 | " \n",
728 | "libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence',\n",
729 | " 'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID')"
730 | ]
731 | },
732 | {
733 | "cell_type": "markdown",
734 | "metadata": {},
735 | "source": [
736 | "## Design negative controls matching the base composition of the library"
737 | ]
738 | },
739 | {
740 | "cell_type": "code",
741 | "execution_count": null,
742 | "metadata": {
743 | "collapsed": true
744 | },
745 | "outputs": [],
746 | "source": [
747 | "#calcluate the base frequency at each position of the sgRNA, then generate random sequences weighted by this frequency\n",
748 | "def getBaseFrequencies(libraryTable, baseConversion = {'G':0, 'C':1, 'T':2, 'A':3}):\n",
749 | " baseArray = np.zeros((len(libraryTable),20))\n",
750 | "\n",
751 | " for i, (index, seq) in enumerate(libraryTable['protospacer sequence'].iteritems()):\n",
752 | " for j, char in enumerate(seq.upper()):\n",
753 | " baseArray[i,j] = baseConversion[char]\n",
754 | "\n",
755 | " baseTable = pd.DataFrame(baseArray, index = libraryTable.index)\n",
756 | " \n",
757 | " baseFrequencies = baseTable.apply(lambda col: col.groupby(col).agg(len)).fillna(0) / len(baseTable)\n",
758 | " baseFrequencies.index = ['G','C','T','A']\n",
759 | " \n",
760 | " baseCumulativeFrequencies = baseFrequencies.copy()\n",
761 | " baseCumulativeFrequencies.loc['C'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C']\n",
762 | " baseCumulativeFrequencies.loc['T'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T']\n",
763 | " baseCumulativeFrequencies.loc['A'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T'] + baseFrequencies.loc['A']\n",
764 | "\n",
765 | " return baseFrequencies, baseCumulativeFrequencies\n",
766 | "\n",
767 | "def generateRandomSequence(baseCumulativeFrequencies):\n",
768 | " randArray = np.random.random(baseCumulativeFrequencies.shape[1])\n",
769 | " \n",
770 | " seq = []\n",
771 | " for i, col in baseCumulativeFrequencies.iteritems():\n",
772 | " for base, freq in col.iteritems():\n",
773 | " if randArray[i] < freq:\n",
774 | " seq.append(base)\n",
775 | " break\n",
776 | " \n",
777 | " return ''.join(seq)"
778 | ]
779 | },
780 | {
781 | "cell_type": "code",
782 | "execution_count": null,
783 | "metadata": {
784 | "collapsed": true
785 | },
786 | "outputs": [],
787 | "source": [
788 | "baseCumulativeFrequencies = getBaseFrequencies(libraryTable_complete)[1]\n",
789 | "negList = []\n",
790 | "for i in range(30000):\n",
791 | " negList.append(generateRandomSequence(baseCumulativeFrequencies))\n",
792 | "negTable = pd.DataFrame(negList, index=['non-targeting_' + str(i) for i in range(30000)], columns = ['sequence'])\n",
793 | "\n",
794 | "outputTempBowtieFastq(negTable, TEMP_FASTQ_FILE)"
795 | ]
796 | },
797 | {
798 | "cell_type": "code",
799 | "execution_count": null,
800 | "metadata": {
801 | "collapsed": true
802 | },
803 | "outputs": [],
804 | "source": [
805 | "#similar to targeting sgRNA off-target scoring, but looking for sgRNAs with 0 alignments\n",
806 | "fqFile = TEMP_FASTQ_FILE\n",
807 | "\n",
808 | "alignmentList = [(31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS_negs'),\n",
809 | " (21,1,'~/indices/hg19.maskChrMandPAR','21_genome_negs')]\n",
810 | "\n",
811 | "alignmentColumns = []\n",
812 | "for btThreshold, mflag, bowtieIndex, runname in alignmentList:\n",
813 | "\n",
814 | " alignedFile = 'bowtie_output/' + runname + '_aligned.txt'\n",
815 | " unalignedFile = 'bowtie_output//' + runname + '_unaligned.fq'\n",
816 | " maxFile = 'bowtie_output/' + runname + '_max.fq'\n",
817 | " \n",
818 | " bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile\n",
819 | " print bowtieString\n",
820 | " print subprocess.call(bowtieString, shell=True)\n",
821 | "\n",
822 | " #read unaligned file for negs, and then don't flip boolean of alignmentTable\n",
823 | " with open(unalignedFile) as infile:\n",
824 | " sgsAligning = set()\n",
825 | " for i, line in enumerate(infile):\n",
826 | " if i%4 == 0: #id line\n",
827 | " sgsAligning.add(line.strip()[1:])\n",
828 | "\n",
829 | " alignmentColumns.append(negTable.apply(lambda row: row.name in sgsAligning, axis=1))\n",
830 | " \n",
831 | "alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3])\n",
832 | "alignmentTable.head()"
833 | ]
834 | },
835 | {
836 | "cell_type": "code",
837 | "execution_count": null,
838 | "metadata": {
839 | "collapsed": true
840 | },
841 | "outputs": [],
842 | "source": [
843 | "acceptedNegList = []\n",
844 | "negCount = 0\n",
845 | "for i, (name, row) in enumerate(pd.concat((negTable,alignmentTable),axis=1, keys=['seq','alignment']).iterrows()):\n",
846 | " oligo = upstreamConstant + row['seq','sequence'] + downstreamConstant\n",
847 | " if row['alignment'].all() and matchREsites(oligo, restrictionSites):\n",
848 | " acceptedNegList.append(('non-targeting_%05d' % negCount, 'negative_control', 'na', row['seq','sequence'], oligo, 0))\n",
849 | " negCount += 1\n",
850 | " \n",
851 | "acceptedNegs = pd.DataFrame(acceptedNegList, columns = ['sgId', 'gene', 'transcript', 'protospacer sequence', 'oligo sequence', 'off-target stringency']).set_index('sgId')"
852 | ]
853 | },
854 | {
855 | "cell_type": "markdown",
856 | "metadata": {},
857 | "source": [
858 | "## Finalizing library design"
859 | ]
860 | },
861 | {
862 | "cell_type": "markdown",
863 | "metadata": {},
864 | "source": [
865 | "* divide genes into sublibrary groups (if required)\n",
866 | "* assign negative control sgRNAs to sublibrary groups; ~1-2% of the number of sgRNAs in the library is a good rule-of-thumb\n",
867 | "* append PCR adapter sequences (~18bp) to each end of the oligo sequences to enable amplification of the oligo pool; each sublibary should have an orthogonal sequence so they can be cloned separately"
868 | ]
869 | },
870 | {
871 | "cell_type": "code",
872 | "execution_count": null,
873 | "metadata": {
874 | "collapsed": true
875 | },
876 | "outputs": [],
877 | "source": []
878 | }
879 | ],
880 | "metadata": {
881 | "kernelspec": {
882 | "display_name": "Python 2",
883 | "language": "python",
884 | "name": "python2"
885 | },
886 | "language_info": {
887 | "codemirror_mode": {
888 | "name": "ipython",
889 | "version": 2
890 | },
891 | "file_extension": ".py",
892 | "mimetype": "text/x-python",
893 | "name": "python",
894 | "nbconvert_exporter": "python",
895 | "pygments_lexer": "ipython2",
896 | "version": "2.7.3"
897 | }
898 | },
899 | "nbformat": 4,
900 | "nbformat_minor": 0
901 | }
902 |
--------------------------------------------------------------------------------
/sgRNA_learning.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | import subprocess
4 | import tempfile
5 | import multiprocessing
6 | import numpy as np
7 | import scipy as sp
8 | import pandas as pd
9 | from ConfigParser import SafeConfigParser
10 | from Bio import Seq, SeqIO
11 | import pysam
12 | from bx.bbi.bigwig_file import BigWigFile
13 | from sklearn import linear_model, svm, ensemble, preprocessing, grid_search, metrics
14 |
15 | from expt_config_parser import parseExptConfig, parseLibraryConfig
16 |
17 | ###############################################################################
18 | # Import and Merge Training/Test Data #
19 | ###############################################################################
20 | def loadExperimentData(experimentFile, supportedLibraryPath, library, basePath = '.'):
21 | libDict, librariesToTables = parseLibraryConfig(os.path.join(supportedLibraryPath, 'library_config.txt'))
22 |
23 | geneTableDict = dict()
24 | phenotypeTableDict = dict()
25 | libraryTableDict = dict()
26 |
27 | parser = SafeConfigParser()
28 | parser.read(experimentFile)
29 | for exptConfigFile in parser.sections():
30 | configDict = parseExptConfig(exptConfigFile,libDict)[0]
31 |
32 | libraryTable = pd.read_csv(os.path.join(basePath,configDict['output_folder'],configDict['experiment_name']) + '_librarytable.txt',
33 | sep='\t', index_col=range(1), header=0)
34 | libraryTableDict[configDict['experiment_name']] = libraryTable
35 |
36 | geneTable = pd.read_csv(os.path.join(basePath,configDict['output_folder'],configDict['experiment_name']) + '_genetable.txt',
37 | sep='\t',index_col=range(2),header=range(3))
38 | phenotypeTable = pd.read_csv(os.path.join(basePath,configDict['output_folder'],configDict['experiment_name']) + '_phenotypetable.txt',\
39 | sep='\t',index_col=range(1),header=range(2))
40 |
41 | condTups = [(condStr.split(':')[0],condStr.split(':')[1]) for condStr in parser.get(exptConfigFile, 'condition_tuples').strip().split('\n')]
42 | # print condTups
43 |
44 | geneTableDict[configDict['experiment_name']] = geneTable.loc[:,[level_name for level_name in geneTable.columns if (level_name[0],level_name[1]) in condTups]]
45 | phenotypeTableDict[configDict['experiment_name']] = phenotypeTable.loc[:,[level_name for level_name in phenotypeTable.columns if (level_name[0],level_name[1]) in condTups]]
46 |
47 | mergedLibraryTable = pd.concat(libraryTableDict.values())
48 | # print mergedLibraryTable.head()
49 | mergedLibraryTable_dedup = mergedLibraryTable.drop_duplicates(['gene','sequence'])
50 | # print mergedLibraryTable_dedup.head()
51 | mergedGeneTable = pd.concat(geneTableDict.values(), keys=geneTableDict.keys(), axis = 1)
52 | # print mergedGeneTable.head()
53 | mergedPhenotypeTable = pd.concat(phenotypeTableDict.values(), keys=phenotypeTableDict.keys(), axis = 1)
54 | # print mergedPhenotypeTable.head()
55 | mergedPhenotypeTable_dedup = mergedPhenotypeTable.loc[mergedLibraryTable_dedup.index]
56 |
57 | return mergedLibraryTable_dedup, mergedPhenotypeTable_dedup, mergedGeneTable
58 |
59 | def calculateDiscriminantScores(geneTable, effectSize = 'average phenotype of strongest 3', pValue = 'Mann-Whitney p-value'):
60 | isPseudo = getPseudoIndices(geneTable)
61 | geneTable_reordered = geneTable.reorder_levels((3,0,1,2), axis=1)
62 | zscores = geneTable_reordered[effectSize] / geneTable_reordered.loc[isPseudo,effectSize].std()
63 | pvals = -1 * np.log10(geneTable_reordered[pValue])
64 |
65 | seriesDict = dict()
66 | for group, table in pd.concat((zscores, pvals), keys=(effectSize,pValue),axis=1).reorder_levels((1,2,3,0), axis=1).groupby(level=range(3),axis=1):
67 | # print table.head()
68 | seriesDict[group] = table[group].apply(lambda row: row[effectSize] * row[pValue], axis=1)
69 |
70 | return pd.DataFrame(seriesDict)
71 |
72 | def getNormalizedsgRNAsOverThresh(libraryTable, phenotypeTable, discriminantTable, threshold, numToNormalize, transcripts=True):
73 | maxDiscriminants = pd.concat([discriminantTable.abs().idxmax(axis=1), discriminantTable.abs().max(axis=1)], keys = ('best col','best score'), axis=1)
74 |
75 | if transcripts:
76 | grouper = (libraryTable['gene'],libraryTable['transcripts'])
77 | else:
78 | grouper = libraryTable['gene']
79 |
80 | normedPhenotypes = []
81 | for name, group in phenotypeTable.groupby(grouper):
82 | if (transcripts and name[0] == 'negative_control') or (not transcripts and name == 'negative_control'):
83 | continue
84 | maxDisc = maxDiscriminants.loc[name]
85 |
86 | if not transcripts:
87 | maxDisc = maxDisc.sort('best score').iloc[-1]
88 |
89 | if maxDisc['best score'] >= threshold:
90 | bestGroup = group[maxDisc['best col']]
91 | normedPhenotypes.append(bestGroup / np.mean(sorted(bestGroup.dropna(), key=abs, reverse=True)[:numToNormalize]))
92 |
93 | return pd.concat(normedPhenotypes), maxDiscriminants
94 |
95 | def getGeneFolds(libraryTable, kfold, transcripts=True):
96 | if transcripts:
97 | geneGroups = pd.Series(range(len(libraryTable)), index=libraryTable.index).groupby((libraryTable['gene'],libraryTable['transcripts']))
98 | else:
99 | geneGroups = pd.Series(range(len(libraryTable)), index=libraryTable.index).groupby(libraryTable['gene'])
100 |
101 | idxList = np.arange(geneGroups.ngroups)
102 | np.random.shuffle(idxList)
103 |
104 | foldsize = int(np.floor(geneGroups.ngroups * 1.0 / kfold))
105 | folds = []
106 | for i in range(kfold):
107 | testGroups = []
108 | trainGroups = []
109 | testSet = set(idxList[i * foldsize: (i+1) * foldsize])
110 | for i, (name, group) in enumerate(geneGroups):
111 | if i in testSet:
112 | testGroups.extend(group.values)
113 | else:
114 | trainGroups.extend(group.values)
115 | folds.append((trainGroups,testGroups))
116 |
117 | return folds
118 |
119 |
120 | ###############################################################################
121 | # Calculate sgRNA Parameters #
122 | ###############################################################################
123 | #tss annotations relying on library input TSSs, may want to convert to gencode in future
124 | def generateTssTable(geneTable, libraryTssFile, cagePeakFile, cageWindow, aliasDict = {'NFIK':'MKI67IP'}):
125 | codingTssList = []
126 | with open(libraryTssFile) as infile:
127 | for line in infile:
128 | linesplit = line.strip().split('\t')
129 | try:
130 | chrom = int(linesplit[2][3:])
131 | except ValueError:
132 | chrom = linesplit[2][3:]
133 | codingTssList.append((chrom, int(linesplit[3]), linesplit[0], linesplit[1], linesplit[2], linesplit[3], linesplit[4]))
134 |
135 | codingTupDict = {(tup[2],tup[3]):tup for tup in codingTssList}
136 |
137 | codingGeneToTransList = dict()
138 |
139 | for geneTrans in codingTupDict:
140 | if geneTrans[0] not in codingGeneToTransList:
141 | codingGeneToTransList[geneTrans[0]] = []
142 |
143 | codingGeneToTransList[geneTrans[0]].append(geneTrans[1])
144 |
145 | positionList = []
146 | for (gene,transcriptList), row in geneTable.iterrows():
147 |
148 | if gene not in codingGeneToTransList: #only pseudogenes
149 | positionList.append((np.nan,np.nan,np.nan))
150 | continue
151 |
152 | if transcriptList == 'all':
153 | positions = [codingTupDict[(gene, trans)][1] for trans in codingGeneToTransList[gene]]
154 | else:
155 | positions = [codingTupDict[(gene, trans)][1] for trans in transcriptList.split(',')]
156 |
157 | positionList.append((np.mean(positions), codingTupDict[(gene,trans)][6], codingTupDict[(gene,trans)][4]))
158 |
159 | tssPositionTable = pd.DataFrame(positionList, index=geneTable.index, columns=['position', 'strand','chromosome'])
160 |
161 | cagePeaks = pysam.Tabixfile(cagePeakFile)
162 | halfwindow = cageWindow
163 | strictColor = '60,179,113'
164 | relaxedColor = '30,144,255'
165 |
166 | cagePeakRanges = []
167 | for i, (gt, tssRow) in enumerate(tssPositionTable.dropna().iterrows()):
168 | peaks = cagePeaks.fetch(tssRow['chromosome'],tssRow['position'] - halfwindow,tssRow['position'] + halfwindow, parser=pysam.asBed())
169 |
170 | ranges = []
171 | relaxedRanges = []
172 | for peak in peaks:
173 | # print peak
174 | if peak.strand == tssRow['strand'] and peak.itemRGB == strictColor:
175 | ranges.append((peak.start, peak.end))
176 | elif peak.strand == tssRow['strand'] and peak.itemRGB == relaxedColor:
177 | relaxedRanges.append((peak.start, peak.end))
178 |
179 | if len(ranges) > 0:
180 | cagePeakRanges.append(ranges)
181 | else:
182 | cagePeakRanges.append(relaxedRanges)
183 |
184 | cageSeries = pd.Series(cagePeakRanges, index = tssPositionTable.dropna().index)
185 |
186 | tssPositionTable_cage = pd.concat([tssPositionTable, cageSeries], axis=1)
187 | tssPositionTable_cage.columns = ['position', 'strand','chromosome','cage peak ranges']
188 | return tssPositionTable_cage
189 |
190 | # for (gene, transList), row in geneTable.iterrows():
191 | # if gene not in gencodeData and gene in aliasDict:
192 | # geneData = gencodeData[aliasDict[gene]]
193 | # else:
194 | # geneData = gencodeData[gene]
195 |
196 | def generateTssTable_P1P2strategy(tssTable, cagePeakFile, matchedp1p2Window, anyp1p2Window, anyPeakWindow, minDistanceForTwoTSS, aliasDict):
197 | cagePeaks = pysam.Tabixfile(cagePeakFile)
198 | strictColor = '60,179,113'
199 | relaxedColor = '30,144,255'
200 |
201 | resultRows = []
202 | for gene, tssRowGroup in tssTable.groupby(level=0):
203 |
204 | if len(set(tssRowGroup['chromosome'].values)) == 1:
205 | chrom = tssRowGroup['chromosome'].values[0]
206 | else:
207 | raise ValueError('mutliple annotated chromosomes for ' + gene)
208 |
209 | if len(set(tssRowGroup['strand'].values)) == 1:
210 | strand = tssRowGroup['strand'].values[0]
211 | else:
212 | raise ValueError('mutliple annotated strands for ' + gene)
213 |
214 | #try to match P1/P2 names within the window
215 | # peaks = cagePeaks.fetch(chrom,max(0,tssRowGroup['position'].min() - matchedp1p2Window),tssRowGroup['position'].max() + matchedp1p2Window, parser=pysam.asBed())
216 | peaks = []
217 | for transcript, row in tssRowGroup.iterrows():
218 | peaks.extend([p for p in cagePeaks.fetch(chrom,max(0,row['position'] - matchedp1p2Window),row['position'] + matchedp1p2Window, parser=pysam.asBed())])
219 | p1Matches = set()
220 | p2Matches = set()
221 | for peak in peaks:
222 | if peak.strand == strand and matchPeakName(peak.name, aliasDict[gene] if gene in aliasDict else [gene], 'p1'):
223 | p1Matches.add((peak.start,peak.end))
224 | elif peak.strand == strand and matchPeakName(peak.name, aliasDict[gene] if gene in aliasDict else [gene], 'p2') and peak.itemRGB == strictColor:
225 | p2Matches.add((peak.start,peak.end))
226 | p1Matches = list(p1Matches)
227 | p2Matches = list(p2Matches)
228 |
229 | if len(p1Matches) >= 1:
230 | if len(p1Matches) > 1:
231 | print 'multiple matched p1:', gene, p1Matches, p2Matches #rare event, typically a doubly-named TSS, basically at the same spot
232 |
233 | closestMatch = p1Matches[0]
234 | for match in p1Matches:
235 | if min(abs(match[0] - tssRowGroup['position'])) < min(abs(closestMatch[0] - tssRowGroup['position'])):
236 | closestMatch = match
237 | p1Matches = [closestMatch]
238 |
239 | if len(p2Matches) > 1:
240 | print 'multiple matched p2:', gene, p1Matches, p2Matches
241 |
242 | closestMatch = p2Matches[0]
243 | for match in p2Matches:
244 | if min(abs(match[0] - tssRowGroup['position'])) < min(abs(closestMatch[0] - tssRowGroup['position'])):
245 | closestMatch = match
246 | p2Matches = [closestMatch]
247 |
248 | if len(p2Matches) == 0 or abs(p1Matches[0][0] - p2Matches[0][0]) <= minDistanceForTwoTSS:
249 | resultRows.append((gene,'P1P2', chrom, strand, 'CAGE, matched peaks', p1Matches[0], p2Matches[0] if len(p2Matches) > 0 else p1Matches[0]))
250 | else:
251 | resultRows.append((gene,'P1', chrom, strand, 'CAGE, matched peaks', p1Matches[0], p1Matches[0]))
252 | resultRows.append((gene,'P2', chrom, strand, 'CAGE, matched peaks', p2Matches[0], p2Matches[0]))
253 |
254 |
255 | #try to match any P1/P2 names
256 | else:
257 | peaks = []
258 | for transcript, row in tssRowGroup.iterrows():
259 | peaks.extend([p for p in cagePeaks.fetch(chrom,max(0,row['position'] - anyp1p2Window),row['position'] + anyp1p2Window, parser=pysam.asBed())])
260 | p1Matches = set()
261 | p2Matches = set()
262 | for peak in peaks:
263 | if peak.strand == strand and peak.name.find('p1@') != -1:
264 | p1Matches.add((peak.start,peak.end))
265 | elif peak.strand == strand and peak.name.find('p2@') != -1 and peak.itemRGB == strictColor:
266 | p2Matches.add((peak.start,peak.end))
267 | p1Matches = list(p1Matches)
268 | p2Matches = list(p2Matches)
269 |
270 | if len(p1Matches) >=1:
271 | if len(p1Matches) > 1:
272 | print 'multiple nearby p1:', gene, p1Matches, p2Matches
273 |
274 | closestMatch = p1Matches[0]
275 | for match in p1Matches:
276 | if min(abs(match[0] - tssRowGroup['position'])) < min(abs(closestMatch[0] - tssRowGroup['position'])):
277 | closestMatch = match
278 | p1Matches = [closestMatch]
279 |
280 | if len(p2Matches) > 1:
281 | print 'multiple nearby p2:', gene, p1Matches, p2Matches
282 |
283 | closestMatch = p2Matches[0]
284 | for match in p2Matches:
285 | if min(abs(match[0] - tssRowGroup['position'])) < min(abs(closestMatch[0] - tssRowGroup['position'])):
286 | closestMatch = match
287 | p2Matches = [closestMatch]
288 |
289 | if len(p2Matches) == 0 or abs(p1Matches[0][0] - p2Matches[0][0]) <= minDistanceForTwoTSS:
290 | resultRows.append((gene,'P1P2', chrom, strand, 'CAGE, primary peaks', p1Matches[0], p2Matches[0] if len(p2Matches) > 0 else p1Matches[0]))
291 | else:
292 | resultRows.append((gene,'P1', chrom, strand, 'CAGE, primary peaks', p1Matches[0], p1Matches[0]))
293 | resultRows.append((gene,'P2', chrom, strand, 'CAGE, primary peaks', p2Matches[0], p2Matches[0]))
294 |
295 |
296 | #try to match robust or permissive peaks
297 | else:
298 | for transcript, row in tssRowGroup.iterrows():
299 | peaks = cagePeaks.fetch(chrom,max(0,row['position']) - anyPeakWindow,row['position'] + anyPeakWindow, parser=pysam.asBed())
300 | robustPeaks = []
301 | permissivePeaks = []
302 | for peak in peaks:
303 | if peak.strand == strand and peak.itemRGB == strictColor:
304 | robustPeaks.append((peak.start,peak.end))
305 | if peak.strand == strand and peak.itemRGB == relaxedColor:
306 | permissivePeaks.append((peak.start,peak.end))
307 |
308 | if len(robustPeaks) >= 1:
309 | if strand == '+':
310 | resultRows.append((gene,transcript[1], chrom, strand, 'CAGE, robust peak', robustPeaks[0], robustPeaks[-1]))
311 | else:
312 | resultRows.append((gene,transcript[1], chrom, strand, 'CAGE, robust peak', robustPeaks[-1], robustPeaks[0]))
313 | elif len(permissivePeaks) >= 1:
314 | if strand == '+':
315 | resultRows.append((gene,transcript[1], chrom, strand, 'CAGE permissive peak', permissivePeaks[0], permissivePeaks[-1]))
316 | else:
317 | resultRows.append((gene,transcript[1], chrom, strand, 'CAGE permissive peak', permissivePeaks[-1], permissivePeaks[0]))
318 | else:
319 | resultRows.append((gene, transcript[1], chrom, strand, 'Annotation', (row['position'],row['position']), (row['position'],row['position'])))
320 |
321 | return pd.DataFrame(resultRows, columns=['gene','transcript','chromosome','strand','TSS source','primary TSS','secondary TSS']).set_index(keys=['gene','transcript'])
322 |
323 | def generateSgrnaDistanceTable_p1p2Strategy(sgInfoTable, libraryTable, p1p2Table, transcripts=False):
324 | sgDistanceSeries = []
325 |
326 | if transcripts == False: # when sgRNAs weren't designed based on the p1p2 strategy
327 | for name, group in sgInfoTable['pam coordinate'].groupby(libraryTable['gene']):
328 | if name in p1p2Table.index:
329 | tssRow = p1p2Table.loc[name]
330 |
331 | if len(tssRow) == 1:
332 | tssRow = tssRow.iloc[0]
333 | for sgId, pamCoord in group.iteritems():
334 | if tssRow['strand'] == '+':
335 | sgDistanceSeries.append((sgId, name, tssRow.name,
336 | pamCoord - tssRow['primary TSS'][0],
337 | pamCoord - tssRow['primary TSS'][1],
338 | pamCoord - tssRow['secondary TSS'][0],
339 | pamCoord - tssRow['secondary TSS'][1]))
340 | else:
341 | sgDistanceSeries.append((sgId, name, tssRow.name,
342 | (pamCoord - tssRow['primary TSS'][1]) * -1,
343 | (pamCoord - tssRow['primary TSS'][0]) * -1,
344 | (pamCoord - tssRow['secondary TSS'][1]) * -1,
345 | (pamCoord - tssRow['secondary TSS'][0]) * -1))
346 |
347 | else:
348 | for sgId, pamCoord in group.iteritems():
349 | closestTssRow = tssRow.loc[tssRow.apply(lambda row: abs(pamCoord - row['primary TSS'][0]), axis=1).idxmin()]
350 |
351 | if closestTssRow['strand'] == '+':
352 | sgDistanceSeries.append((sgId, name, closestTssRow.name,
353 | pamCoord - closestTssRow['primary TSS'][0],
354 | pamCoord - closestTssRow['primary TSS'][1],
355 | pamCoord - closestTssRow['secondary TSS'][0],
356 | pamCoord - closestTssRow['secondary TSS'][1]))
357 | else:
358 | sgDistanceSeries.append((sgId, name, closestTssRow.name,
359 | (pamCoord - closestTssRow['primary TSS'][1]) * -1,
360 | (pamCoord - closestTssRow['primary TSS'][0]) * -1,
361 | (pamCoord - closestTssRow['secondary TSS'][1]) * -1,
362 | (pamCoord - closestTssRow['secondary TSS'][0]) * -1))
363 | else:
364 | for name, group in sgInfoTable['pam coordinate'].groupby([libraryTable['gene'],libraryTable['transcripts']]):
365 | if name in p1p2Table.index:
366 | tssRow = p1p2Table.loc[[name]]
367 |
368 | if len(tssRow) == 1:
369 | tssRow = tssRow.iloc[0]
370 | for sgId, pamCoord in group.iteritems():
371 | if tssRow['strand'] == '+':
372 | sgDistanceSeries.append((sgId, tssRow.name[0], tssRow.name[1],
373 | pamCoord - tssRow['primary TSS'][0],
374 | pamCoord - tssRow['primary TSS'][1],
375 | pamCoord - tssRow['secondary TSS'][0],
376 | pamCoord - tssRow['secondary TSS'][1]))
377 | else:
378 | sgDistanceSeries.append((sgId, tssRow.name[0], tssRow.name[1],
379 | (pamCoord - tssRow['primary TSS'][1]) * -1,
380 | (pamCoord - tssRow['primary TSS'][0]) * -1,
381 | (pamCoord - tssRow['secondary TSS'][1]) * -1,
382 | (pamCoord - tssRow['secondary TSS'][0]) * -1))
383 |
384 | else:
385 | print name, tssRow
386 | raise ValueError('all gene/trans pairs should be unique')
387 |
388 | return pd.DataFrame(sgDistanceSeries, columns=['sgId', 'gene', 'transcript', 'primary TSS-Up', 'primary TSS-Down', 'secondary TSS-Up', 'secondary TSS-Down']).set_index(keys=['sgId'])
389 |
390 | def generateSgrnaDistanceTable(sgInfoTable, tssTable, libraryTable):
391 | sgDistanceSeries = []
392 |
393 | for name, group in sgInfoTable['pam coordinate'].groupby([libraryTable['gene'],libraryTable['transcripts']]):
394 | if name in tssTable.index:
395 | tssRow = tssTable.loc[name]
396 | if len(tssRow['cage peak ranges']) != 0:
397 | spotList = []
398 | for rangeTup in tssRow['cage peak ranges']:
399 | spotList.append((rangeTup[0] - tssRow['position']) * (-1 if tssRow['strand'] == '-' else 1))
400 | spotList.append((rangeTup[1] - tssRow['position']) * (-1 if tssRow['strand'] == '-' else 1))
401 |
402 | sgDistanceSeries.append(group.apply(lambda row: distanceMetrics(row, tssRow['position'], min(spotList),max(spotList),tssRow['strand'])))
403 |
404 | else:
405 | sgDistanceSeries.append(group.apply(lambda row: distanceMetrics(row, tssRow['position'], 0, 0, tssRow['strand'])))
406 |
407 | return pd.concat(sgDistanceSeries)
408 |
409 | def distanceMetrics(position, annotatedTss, cageUp, cageDown, strand):
410 | relativePos = (position - annotatedTss) * (1 if strand == '+' else -1)
411 |
412 | return pd.Series((relativePos, relativePos-cageUp, relativePos-cageDown), index=('annotated','cageUp','cageDown'))
413 |
414 | def generateSgrnaLengthSeries(libraryTable):
415 | lengthSeries = libraryTable.apply(lambda row: len(row['sequence']),axis=1)
416 | lengthSeries.name = 'length'
417 | return lengthSeries
418 |
419 | def generateRelativeBasesAndStrand(sgInfoTable, tssTable, libraryTable, genomeDict):
420 | relbases = []
421 | strands = []
422 | sgIds = []
423 | for gene, sgInfoGroup in sgInfoTable.groupby(libraryTable['gene']):
424 | tssRowGroup = tssTable.loc[gene]
425 |
426 | if len(set(tssRowGroup['chromosome'].values)) == 1:
427 | chrom = tssRowGroup['chromosome'].values[0]
428 | else:
429 | raise ValueError('mutliple annotated chromosomes for ' + gene)
430 |
431 | if len(set(tssRowGroup['strand'].values)) == 1:
432 | strand = tssRowGroup['strand'].values[0]
433 | else:
434 | raise ValueError('mutliple annotated strands for ' + gene)
435 |
436 | for sg, sgInfo in sgInfoGroup.iterrows():
437 | sgIds.append(sg)
438 | geneTup = (sgInfo['gene_name'],','.join(sgInfo['transcript_list']))
439 | strands.append(True if sgInfo['strand'] == strand else False)
440 |
441 | baseMatrix = []
442 | for pos in np.arange(-30,10):
443 | baseMatrix.append(getBaseRelativeToPam(chrom, sgInfo['pam coordinate'],sgInfo['length'], sgInfo['strand'], pos, genomeDict))
444 | relbases.append(baseMatrix)
445 |
446 | relbases = pd.DataFrame(relbases, index = sgIds, columns = np.arange(-30,10)).loc[libraryTable.index]
447 | strands = pd.DataFrame(strands, index = sgIds, columns = ['same strand']).loc[libraryTable.index]
448 |
449 | return relbases, strands
450 |
451 | def generateBooleanBaseTable(baseTable):
452 | relbases_bool = []
453 | for base in ['A','G','C','T']:
454 | relbases_bool.append(baseTable.applymap(lambda val: val == base))
455 |
456 | return pd.concat(relbases_bool, keys=['A','G','C','T'], axis=1)
457 |
458 | def generateBooleanDoubleBaseTable(baseTable):
459 | doubleBaseTable = []
460 | tableCols = []
461 | for b1 in ['A','G','C','T']:
462 | for b2 in ['A','G','C','T']:
463 | for i in np.arange(-30,8):
464 | doubleBaseTable.append(pd.concat((baseTable[i] == b1, baseTable[i+1] == b2),axis=1).all(axis=1))
465 | tableCols.append(((b1,b2),i))
466 | return pd.concat(doubleBaseTable, keys=tableCols, axis=1)
467 |
468 | def getBaseRelativeToPam(chrom, pamPos, length, strand, relPos, genomeDict):
469 | rc = {'A':'T','T':'A','G':'C','C':'G','N':'N'}
470 | #print chrom, pamPos, relPos
471 | if strand == '+':
472 | return rc[genomeDict[chrom][pamPos - relPos].upper()]
473 | elif strand == '-':
474 | return genomeDict[chrom][pamPos + relPos].upper()
475 | else:
476 | raise ValueError()
477 |
478 | def getMaxLengthHomopolymer(sequence, base):
479 | sequence = sequence.upper()
480 | base = base.upper()
481 |
482 | maxBaseCount = 0
483 | curBaseCount = 0
484 | for b in sequence:
485 | if b == base:
486 | curBaseCount += 1
487 | else:
488 | maxBaseCount = max((curBaseCount, maxBaseCount))
489 | curBaseCount = 0
490 |
491 | return max((curBaseCount, maxBaseCount))
492 |
493 | def getFractionBaseList(sequence, baseList):
494 | baseSet = [base.upper() for base in baseList]
495 | counter = 0.0
496 | for b in sequence.upper():
497 | if b in baseSet:
498 | counter += 1.0
499 |
500 | return counter / len(sequence)
501 |
502 | #need to fix file naming
503 | def getRNAfoldingTable(libraryTable):
504 | tempfile_fa = tempfile.NamedTemporaryFile('w+t', delete=False)
505 | tempfile_rnafold = tempfile.NamedTemporaryFile('w+t', delete=False)
506 |
507 | for name, row in libraryTable.iterrows():
508 | tempfile_fa.write('>' + name + '\n' + row['sequence'] + '\n')
509 |
510 | tempfile_fa.close()
511 | tempfile_rnafold.close()
512 | # print tempfile_fa.name, tempfile_rnafold.name
513 |
514 | subprocess.call('RNAfold --noPS < %s > %s' % (tempfile_fa.name, tempfile_rnafold.name), shell=True)
515 |
516 | mfeSeries_noScaffold = parseViennaMFE(tempfile_rnafold.name, libraryTable)
517 | isPaired = parseViennaPairing(tempfile_rnafold.name, libraryTable)
518 |
519 | tempfile_fa = tempfile.NamedTemporaryFile('w+t', delete=False)
520 | tempfile_rnafold = tempfile.NamedTemporaryFile('w+t', delete=False)
521 |
522 | with open(tempfile_fa.name,'w') as outfile:
523 | for name, row in libraryTable.iterrows():
524 | outfile.write('>' + name + '\n' + row['sequence'] + 'GTTTAAGAGCTAAGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT\n')
525 |
526 | tempfile_fa.close()
527 | tempfile_rnafold.close()
528 | # print tempfile_fa.name, tempfile_rnafold.name
529 |
530 | subprocess.call('RNAfold --noPS < %s > %s' % (tempfile_fa.name, tempfile_rnafold.name), shell=True)
531 |
532 | mfeSeries_wScaffold = parseViennaMFE(tempfile_rnafold.name, libraryTable)
533 |
534 | return pd.concat((mfeSeries_noScaffold, mfeSeries_wScaffold, isPaired), keys=('no scaffold', 'with scaffold', 'is Paired'), axis=1)
535 |
536 | def parseViennaMFE(viennaOutputFile, libraryTable):
537 | mfeList = []
538 | with open(viennaOutputFile) as infile:
539 | for i, line in enumerate(infile):
540 | if i%3 == 2:
541 | mfeList.append(float(line.strip().strip('.() ')))
542 | return pd.Series(mfeList, index=libraryTable.index, name='RNA minimum free energy')
543 |
544 | def parseViennaPairing(viennaOutputFile, libraryTable):
545 | paired = []
546 | with open(viennaOutputFile) as infile:
547 | for i, line in enumerate(infile):
548 | if i%3 == 2:
549 | foldString = line.strip().split(' ')[0]
550 | paired.append([char != '.' for char in foldString[-18:]])
551 | return pd.DataFrame(paired, index=libraryTable.index, columns = range(-20,-2))
552 |
553 | def getChromatinDataSeries(bigwigFile, libraryTable, sgInfoTable, tssTable, colname = '', naValue = 0):
554 | bwindex = BigWigFile(open(bigwigFile))
555 | chromDict = tssTable['chromosome'].to_dict()
556 |
557 | chromatinScores = []
558 | for name, sgInfo in sgInfoTable.iterrows():
559 | geneTup = (sgInfo['gene_name'],','.join(sgInfo['transcript_list']))
560 |
561 | if geneTup not in chromDict: #negative controls
562 | chromatinScores.append(np.nan)
563 | continue
564 |
565 | if sgInfo['strand'] == '+':
566 | sgRange = sgInfo['pam coordinate'] + sgInfo['length']
567 | else:
568 | sgRange = sgInfo['pam coordinate'] - sgInfo['length']
569 |
570 | chrom = chromDict[geneTup]
571 |
572 | chromatinArray = bwindex.get_as_array(chrom, min(sgInfo['pam coordinate'], sgRange), max(sgInfo['pam coordinate'], sgRange))
573 | if chromatinArray is not None and len(chromatinArray) > 0:
574 | chromatinScores.append(np.nanmean(chromatinArray))
575 | else: #often chrY when using K562 data..
576 | # print name
577 | # print chrom, min(sgInfo['pam coordinate'], sgRange), max(sgInfo['pam coordinate'], sgRange)
578 | chromatinScores.append(np.nan)
579 |
580 | chromatinSeries = pd.Series(chromatinScores, index=libraryTable.index, name = colname)
581 |
582 | return chromatinSeries.fillna(naValue)
583 |
584 | def getChromatinDataSeriesByGene(bigwigFileHandle, libraryTable, sgInfoTable, p1p2Table, sgrnaDistanceTable_p1p2, colname = '', naValue = 0, normWindow = 1000):
585 | bwindex = bigwigFileHandle #BigWigFile(open(bigwigFile))
586 |
587 | chromatinScores = []
588 | for (gene, transcript), sgInfoGroup in sgInfoTable.groupby([sgrnaDistanceTable_p1p2['gene'], sgrnaDistanceTable_p1p2['transcript']]):
589 | tssRow = p1p2Table.loc[[(gene, transcript)]].iloc[0,:]
590 |
591 | chrom = tssRow['chromosome']
592 |
593 | normWindowArray = bwindex.get_as_array(chrom, max(0, tssRow['primary TSS'][0] - normWindow), tssRow['primary TSS'][0] + normWindow)
594 | if normWindowArray is not None:
595 | normFactor = np.nanmax(normWindowArray)
596 | else:
597 | normFactor = 1
598 |
599 | windowMin = max(0, min(sgInfoGroup['pam coordinate']) - max(sgInfoGroup['length']) - 10)
600 | windowMax = max(sgInfoGroup['pam coordinate']) + max(sgInfoGroup['length']) + 10
601 | chromatinWindow = bwindex.get_as_array(chrom, windowMin, windowMax)
602 |
603 | chromatinScores.append(sgInfoGroup.apply(lambda row: getChromatinData(row, chromatinWindow, windowMin, normFactor), axis=1))
604 |
605 |
606 | chromatinSeries = pd.concat(chromatinScores)
607 |
608 | return chromatinSeries.fillna(naValue)
609 |
610 | def getChromatinData(sgInfoRow, chromatinWindowArray, windowMin, normFactor):
611 | if sgInfoRow['strand'] == '+':
612 | sgRange = sgInfoRow['pam coordinate'] + sgInfoRow['length']
613 | else:
614 | sgRange = sgInfoRow['pam coordinate'] - sgInfoRow['length']
615 |
616 |
617 | if chromatinWindowArray is not None:# and len(chromatinWindowArray) > 0:
618 | chromatinArray = chromatinWindowArray[min(sgInfoRow['pam coordinate'], sgRange) - windowMin: max(sgInfoRow['pam coordinate'], sgRange) - windowMin]
619 | return np.nanmean(chromatinArray)/normFactor
620 | else: #often chrY when using K562 data..
621 | # print name
622 | # print chrom, min(sgInfo['pam coordinate'], sgRange), max(sgInfo['pam coordinate'], sgRange)
623 | return np.nan
624 |
625 | def generateTypicalParamTable(libraryTable, sgInfoTable, tssTable, p1p2Table, genomeDict, bwFileHandleDict, transcripts=False):
626 | lengthSeries = generateSgrnaLengthSeries(libraryTable)
627 |
628 | # sgrnaPositionTable = generateSgrnaDistanceTable(sgInfoTable, tssTable, libraryTable)
629 | sgrnaPositionTable_p1p2 = generateSgrnaDistanceTable_p1p2Strategy(sgInfoTable, libraryTable, p1p2Table, transcripts)
630 |
631 | baseTable, strand = generateRelativeBasesAndStrand(sgInfoTable, tssTable, libraryTable, genomeDict)
632 | booleanBaseTable = generateBooleanBaseTable(baseTable)
633 | doubleBaseTable = generateBooleanDoubleBaseTable(baseTable)
634 |
635 | printNow('.')
636 | baseList = ['A','G','C','T']
637 | homopolymerTable = pd.concat([libraryTable.apply(lambda row: np.floor(getMaxLengthHomopolymer(row['sequence'], base)), axis=1) for base in baseList],keys=baseList,axis=1)
638 |
639 | baseFractions = pd.concat([libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['A']),axis=1),
640 | libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['G']),axis=1),
641 | libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['C']),axis=1),
642 | libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['T']),axis=1),
643 | libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['G','C']),axis=1),
644 | libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['G','A']),axis=1),
645 | libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['C','A']),axis=1)],keys=['A','G','C','T','GC','purine','CA'],axis=1)
646 |
647 | printNow('.')
648 |
649 | dnaseSeries = getChromatinDataSeriesByGene(bwFileHandleDict['dnase'], libraryTable, sgInfoTable, p1p2Table, sgrnaPositionTable_p1p2)
650 | printNow('.')
651 | faireSeries = getChromatinDataSeriesByGene(bwFileHandleDict['faire'], libraryTable, sgInfoTable, p1p2Table, sgrnaPositionTable_p1p2)
652 | printNow('.')
653 | mnaseSeries = getChromatinDataSeriesByGene(bwFileHandleDict['mnase'], libraryTable, sgInfoTable, p1p2Table, sgrnaPositionTable_p1p2)
654 | printNow('.')
655 |
656 | rnafolding = getRNAfoldingTable(libraryTable)
657 |
658 | printNow('Done!')
659 |
660 | return pd.concat([lengthSeries,
661 | sgrnaPositionTable_p1p2.iloc[:,2:],
662 | homopolymerTable,
663 | baseFractions,
664 | strand,
665 | booleanBaseTable['A'],
666 | booleanBaseTable['T'],
667 | booleanBaseTable['G'],
668 | booleanBaseTable['C'],
669 | doubleBaseTable,
670 | pd.concat([dnaseSeries,faireSeries,mnaseSeries],keys=['DNase','FAIRE','MNase'], axis=1),
671 | rnafolding['no scaffold'],
672 | rnafolding['with scaffold'],
673 | rnafolding['is Paired']],keys=['length',
674 | 'distance',
675 | 'homopolymers',
676 | 'base fractions',
677 | 'strand',
678 | 'base table-A',
679 | 'base table-T',
680 | 'base table-G',
681 | 'base table-C',
682 | 'base dimers',
683 | 'accessibility',
684 | 'RNA folding-no scaffold',
685 | 'RNA folding-with scaffold',
686 | 'RNA folding-pairing, no scaffold'],axis=1)
687 |
688 | # def generateTypicalParamTable_parallel(libraryTable, sgInfoTable, tssTable, p1p2Table, genomeDict, bwFileHandleDict, processors):
689 | # processPool = multiprocessing.Pool(processors)
690 |
691 | # colTupList = zip([group for gene, group in libraryTable.groupby(libraryTable['gene'])],
692 | # [group for gene, group in sgInfoTable.groupby(libraryTable['gene'])])
693 |
694 | # result = processPool.map(lambda colTup: generateTypicalParamTable(colTup[0], colTup[1], tssTable, p1p2Table, genomeDict,bwFileHandleDict), colTupList)
695 |
696 | # return pd.concat(result)
697 |
698 | ###############################################################################
699 | # Learn Parameter Weights #
700 | ###############################################################################
701 | def fitParams(paramTable, scoreTable, fitTable):
702 | predictedParams = []
703 | estimators = []
704 |
705 | for i, (name, col) in enumerate(paramTable.iteritems()):
706 |
707 | fitRow = fitTable.iloc[i]
708 |
709 | if fitRow['type'] == 'binary': #binary parameter
710 | # print name, 'is binary parameter'
711 | predictedParams.append(col)
712 | estimators.append('binary')
713 |
714 | elif fitRow['type'] == 'continuous':
715 | col_reshape = col.values.reshape(len(col),1)
716 | parameters = fitRow['params']
717 |
718 | svr = svm.SVR(cache_size=500)
719 | clf = grid_search.GridSearchCV(svr, parameters, n_jobs=16, verbose=0)
720 | clf.fit(col_reshape, scoreTable)
721 |
722 | print name, clf.best_params_
723 | predictedParams.append(pd.Series(clf.predict(col_reshape), index=col.index, name=name))
724 | estimators.append(clf.best_estimator_)
725 |
726 | elif fitRow['type'] == 'binnable':
727 | parameters = fitRow['params']
728 |
729 | assignedBins = binValues(col, parameters['bin width'], parameters['min edge data'])
730 | groupStats = scoreTable.groupby(assignedBins).agg(parameters['bin function'])
731 |
732 | # print name
733 | # print pd.concat((groupStats,scoreTable.groupby(assignedBins).size()), axis=1)
734 |
735 | binnedScores = assignedBins.apply(lambda binVal: groupStats.loc[binVal])
736 |
737 | predictedParams.append(binnedScores)
738 | estimators.append(groupStats)
739 |
740 | elif fitRow['type'] == 'binnable_onehot':
741 | parameters = fitRow['params']
742 |
743 | assignedBins = binValues(col, parameters['bin width'], parameters['min edge data'])
744 | binGroups = scoreTable.groupby(assignedBins)
745 | groupStats = binGroups.agg(parameters['bin function'])
746 |
747 | # print name
748 | # print pd.concat((groupStats,scoreTable.groupby(assignedBins).size()), axis=1)
749 |
750 | oneHotFrame = pd.DataFrame(np.zeros((len(assignedBins),len(binGroups))), index = assignedBins.index, \
751 | columns=pd.MultiIndex.from_tuples([(name[0],', '.join([name[1],key])) for key in sorted(binGroups.groups.keys())]))
752 |
753 | for groupName, group in binGroups:
754 | oneHotFrame.loc[group.index, (name[0],', '.join([name[1],groupName]))] = 1
755 |
756 | predictedParams.append(oneHotFrame)
757 | estimators.append(groupStats)
758 |
759 | else:
760 | raise ValueError(fitRow['type'] + 'not implemented')
761 |
762 | return pd.concat(predictedParams, axis=1), estimators
763 |
764 | def binValues(col, binsize, minEdgePoints=0, edgeOffset = None):
765 | bins = np.floor(col / binsize) * binsize
766 |
767 | if minEdgePoints <= 0:
768 | if edgeOffset == None:
769 | return bins.apply(lambda binVal: str(binVal))
770 | else:
771 | return bins
772 | elif minEdgePoints >= len(col):
773 | raise ValueError('too few data points to meet minimum edge requirements')
774 | else:
775 | binGroups = bins.groupby(bins)
776 | binCounts = binGroups.agg(len).sort_index()
777 |
778 | i = 0
779 | leftBin = []
780 | if binCounts.iloc[i] < minEdgePoints:
781 | leftCount = 0
782 | while leftCount < minEdgePoints:
783 | leftCount += binCounts.iloc[i]
784 | leftBin.append(binCounts.index[i])
785 | i += 1
786 |
787 | leftLessThan = binCounts.index[i]
788 |
789 | j = -1
790 | rightBin = []
791 | if binCounts.iloc[j] < minEdgePoints:
792 | rightCount = 0
793 | while rightCount < minEdgePoints:
794 | rightBin.append(binCounts.index[j])
795 | rightCount += binCounts.iloc[j]
796 | j -= 1
797 |
798 | rightMoreThan = binCounts.index[j + 1]
799 |
800 | if i > len(binCounts) + j:
801 | raise ValueError('min edge requirements cannot be met')
802 |
803 | if edgeOffset == None: #return strings for bins, fine for grouping, problems for plotting
804 | return bins.apply(lambda binVal: '< %f' % leftLessThan if binVal in leftBin else('>= %f' % rightMoreThan if binVal in rightBin else str(binVal)))
805 | else: #apply arbitrary offset instead to ease plotting
806 | return bins.apply(lambda binVal: leftLessThan - edgeOffset if binVal in leftBin else(rightMoreThan + edgeOffset if binVal in rightBin else binVal))
807 |
808 | def transformParams(paramTable, fitTable, estimators):
809 | transformedParams = []
810 |
811 | for i, (name, col) in enumerate(paramTable.iteritems()):
812 | fitRow = fitTable.iloc[i]
813 |
814 | if fitRow['type'] == 'binary':
815 | transformedParams.append(col)
816 | elif fitRow['type'] == 'continuous':
817 | col_reshape = col.values.reshape(len(col),1)
818 | transformedParams.append(pd.Series(estimators[i].predict(col_reshape), index=col.index, name=name))
819 | elif fitRow['type'] == 'binnable':
820 | binStats = estimators[i]
821 | assignedBins = applyBins(col, binStats.index.values)
822 | transformedParams.append(assignedBins.apply(lambda binVal: binStats.loc[binVal]))
823 |
824 | elif fitRow['type'] == 'binnable_onehot':
825 | binStats = estimators[i]
826 |
827 | assignedBins = applyBins(col, binStats.index.values)
828 | binGroups = col.groupby(assignedBins)
829 |
830 | # print name
831 | # print pd.concat((groupStats,scoreTable.groupby(assignedBins).size()), axis=1)
832 |
833 | oneHotFrame = pd.DataFrame(np.zeros((len(assignedBins),len(binGroups))), index = assignedBins.index, \
834 | columns=pd.MultiIndex.from_tuples([(name[0],', '.join([name[1],key])) for key in sorted(binGroups.groups.keys())]))
835 |
836 | for groupName, group in binGroups:
837 | oneHotFrame.loc[group.index, (name[0],', '.join([name[1],groupName]))] = 1
838 |
839 | transformedParams.append(oneHotFrame)
840 |
841 | return pd.concat(transformedParams, axis=1)
842 |
843 | def applyBins(column, binStrings):
844 | leftLabel = ''
845 | rightLabel = ''
846 | binTups = []
847 | for binVal in binStrings:
848 | if binVal[0] == '<':
849 | leftLabel = binVal
850 | elif binVal[0] == '>':
851 | rightLabel = binVal
852 | rightBound = float(binVal[3:])
853 | else:
854 | binTups.append((float(binVal),binVal))
855 |
856 | binTups.sort()
857 | # print binTups
858 | leftBound = binTups[0][0]
859 | if leftLabel == '':
860 | leftLabel = binTups[0][1]
861 |
862 | if rightLabel == '':
863 | rightLabel = binTups[-1][1]
864 | rightBound = binTups[-1][0]
865 |
866 | def binFunc(val):
867 | return leftLabel if val < leftBound else (rightLabel if val >= rightBound else [tup[1] for tup in binTups if val >= tup[0]][-1])
868 |
869 | return column.apply(binFunc)
870 |
871 | ###############################################################################
872 | # Predict sgRNA Scores and Library?? #
873 | ###############################################################################
874 | def findAllGuides(p1p2Table, genomeDict, rangeTup, sgRNALength=20):
875 | newLibraryTable = []
876 | newSgInfoTable = []
877 |
878 | for tssTup, tssRow in p1p2Table.iterrows():
879 | rangeStart = min(min(tssRow['primary TSS']), min(tssRow['secondary TSS'])) + (rangeTup[0] if tssRow['strand'] == '+' else -1 * rangeTup[1])
880 | rangeEnd = max(max(tssRow['primary TSS']), max(tssRow['secondary TSS'])) + (rangeTup[1] if tssRow['strand'] == '+' else -1 * rangeTup[0])
881 |
882 | genomeRange = str(genomeDict[tssRow['chromosome']][rangeStart:rangeEnd + 1].seq)
883 |
884 | rangeLength = rangeEnd + 1 - rangeStart
885 | for posOffset in range(rangeLength):
886 | if genomeRange[posOffset:posOffset+1+1].upper() == 'CC' \
887 | and posOffset + 3 + sgRNALength < rangeLength \
888 | and 'N' not in genomeRange[posOffset:posOffset+3+sgRNALength].upper():
889 | pamCoord = rangeStart+posOffset
890 | sgId = tssTup[0] + '_' + '+' + '_' + str(pamCoord) + '.' + str(sgRNALength + 3) + '-' + tssTup[1]
891 | gene = tssTup[0]
892 | transcripts = tssTup[1]
893 | rawSequence = genomeRange[posOffset:posOffset+3+sgRNALength]
894 | sequence = 'G' + str(Seq.Seq(rawSequence[3:-1]).reverse_complement())
895 |
896 | newLibraryTable.append((sgId, gene, transcripts, sequence, rawSequence))
897 | newSgInfoTable.append((sgId, 'None', gene, sgRNALength + 3, pamCoord, 'not assigned', pamCoord, '+', tssTup[1].split(',')))
898 |
899 | elif genomeRange[posOffset-1:posOffset+1].upper() == 'GG' \
900 | and posOffset - 3 - sgRNALength >= 0 \
901 | and 'N' not in genomeRange[posOffset + 1 - 3 - sgRNALength:posOffset+1].upper():
902 | pamCoord = rangeStart+posOffset
903 | sgId = tssTup[0] + '_' + '-' + '_' + str(pamCoord) + '.' + str(sgRNALength + 3) + '-' + tssTup[1]
904 | gene = tssTup[0]
905 | transcripts = tssTup[1]
906 | rawSequence = genomeRange[posOffset + 1 - 3 - sgRNALength:posOffset+1]
907 | sequence = 'G' + rawSequence[1:-3]
908 |
909 | newLibraryTable.append((sgId, gene, transcripts, sequence, rawSequence))
910 | newSgInfoTable.append((sgId, 'None', gene, sgRNALength + 3, pamCoord, 'not assigned', pamCoord, '-', tssTup[1].split(',')))
911 |
912 | return pd.DataFrame(newLibraryTable,columns=['sgId','gene','transcripts','sequence', 'genomic sequence']).set_index('sgId'), \
913 | pd.DataFrame(newSgInfoTable, columns=['sgId','Sublibrary','gene_name', 'length', 'pam coordinate','pass_score','position','strand', 'transcript_list']).set_index('sgId')
914 |
915 | ###############################################################################
916 | # Utility Functions #
917 | ###############################################################################
918 | def getPseudoIndices(table):
919 | return table.apply(lambda row: row.name[0][:6] == 'pseudo', axis=1)
920 |
921 | def loadGencodeData(gencodeGTF, indexByENSG = True):
922 | printNow('Loading annotation file...')
923 | gencodeData = dict()
924 | with open(gencodeGTF) as gencodeFile:
925 | for line in gencodeFile:
926 | if line[0] != '#':
927 | linesplit = line.strip().split('\t')
928 | attrsplit = linesplit[-1].strip('; ').split('; ')
929 | attrdict = {attr.split(' ')[0]:attr.split(' ')[1].strip('\"') for attr in attrsplit if attr[:3] !='tag'}
930 | attrdict['tags'] = [attr.split(' ')[1].strip('\"') for attr in attrsplit if attr[:3] == 'tag']
931 |
932 | if indexByENSG:
933 | dictKey = attrdict['gene_id'].split('.')[0]
934 | else:
935 | dictKey = attrdict['gene_name']
936 |
937 | #catch y-linked pseudoautosomal genes
938 | if 'PAR' in attrdict['tags'] and linesplit[0] == 'chrY':
939 | continue
940 |
941 | if linesplit[2] == 'gene':# and attrdict['gene_type'] == 'protein_coding':
942 | gencodeData[dictKey] = ([linesplit[0],long(linesplit[3]),long(linesplit[4]),linesplit[6], attrdict],[])
943 | elif linesplit[2] == 'transcript':
944 | gencodeData[dictKey][1].append([linesplit[0],long(linesplit[3]),long(linesplit[4]),linesplit[6], attrdict])
945 |
946 | printNow('Done\n')
947 |
948 | return gencodeData
949 |
950 | def loadGenomeAsDict(genomeFasta):
951 | printNow('Loading genome file...')
952 | genomeDict = SeqIO.to_dict(SeqIO.parse(genomeFasta,'fasta'))
953 | printNow('Done\n')
954 | return genomeDict
955 |
956 | def loadCageBedData(cageBedFile, matchList = ['p1','p2']):
957 | cageBedDict = {match:dict() for match in matchList}
958 |
959 | with open(cageBedFile) as infile:
960 | for line in infile:
961 | linesplit = line.strip().split('\t')
962 |
963 | for name in linesplit[3].split(','):
964 | namesplit = name.split('@')
965 | if len(namesplit) == 2:
966 | for match in matchList:
967 | if namesplit[0] == match:
968 | cageBedDict[match][namesplit[1]] = linesplit
969 |
970 | return cageBedDict
971 |
972 | def matchPeakName(peakName, geneAliasList, promoterRank):
973 | for peakString in peakName.split(','):
974 | peakSplit = peakString.split('@')
975 |
976 | if len(peakSplit) == 2\
977 | and peakSplit[0] == promoterRank\
978 | and peakSplit[1] in geneAliasList:
979 | return True
980 |
981 | if len(peakSplit) > 2:
982 | print peakName
983 |
984 | return False
985 |
986 | def generateAliasDict(hgncFile, gencodeData):
987 | hgncTable = pd.read_csv(hgncFile,sep='\t', header=0).fillna('')
988 |
989 | geneToAliases = dict()
990 | geneToENSG = dict()
991 |
992 | for i, row in hgncTable.iterrows():
993 | geneToAliases[row['Approved Symbol']] = [row['Approved Symbol']]
994 | geneToAliases[row['Approved Symbol']].extend([] if len(row['Previous Symbols']) == 0 else [name.strip() for name in row['Previous Symbols'].split(',')])
995 | geneToAliases[row['Approved Symbol']].extend([] if len(row['Synonyms']) == 0 else [name.strip() for name in row['Synonyms'].split(',')])
996 |
997 | geneToENSG[row['Approved Symbol']] = row['Ensembl Gene ID']
998 |
999 | # for gene in gencodeData:
1000 | # if gene not in geneToAliases:
1001 | # geneToAliases[gene] = [gene]
1002 |
1003 | # geneToAliases[gene].extend([tr[-1]['transcript_id'].split('.')[0] for tr in gencodeData[gene][1]])
1004 |
1005 | return geneToAliases, geneToENSG
1006 |
1007 | #Parse information from the sgRNA ID standard format
1008 | def parseSgId(sgId):
1009 | parseDict = dict()
1010 |
1011 | #sublibrary
1012 | if len(sgId.split('=')) == 2:
1013 | parseDict['Sublibrary'] = sgId.split('=')[0]
1014 | remainingId = sgId.split('=')[1]
1015 | else:
1016 | parseDict['Sublibrary'] = None
1017 | remainingId = sgId
1018 |
1019 | #gene name and strand
1020 | underscoreSplit = remainingId.split('_')
1021 |
1022 | for i,item in enumerate(underscoreSplit):
1023 | if item == '+':
1024 | strand = '+'
1025 | geneName = '_'.join(underscoreSplit[:i])
1026 | remainingId = '_'.join(underscoreSplit[i+1:])
1027 | break
1028 | elif item == '-':
1029 | strand = '-'
1030 | geneName = '_'.join(underscoreSplit[:i])
1031 | remainingId = '_'.join(underscoreSplit[i+1:])
1032 | break
1033 | else:
1034 | continue
1035 |
1036 | parseDict['strand'] = strand
1037 | parseDict['gene_name'] = geneName
1038 |
1039 | #position
1040 | dotSplit = remainingId.split('.')
1041 | parseDict['position'] = int(dotSplit[0])
1042 | remainingId = '.'.join(dotSplit[1:])
1043 |
1044 | #length incl pam
1045 | dashSplit = remainingId.split('-')
1046 | parseDict['length'] = int(dashSplit[0])
1047 | remainingId = '-'.join(dashSplit[1:])
1048 |
1049 | #pass score
1050 | tildaSplit = remainingId.split('~')
1051 | parseDict['pass_score'] = tildaSplit[-1]
1052 | remainingId = '~'.join(tildaSplit[:-1]) #should always be length 1 anyway
1053 |
1054 | #transcripts
1055 | parseDict['transcript_list'] = remainingId.split(',')
1056 |
1057 | return parseDict
1058 |
1059 | def parseAllSgIds(libraryTable):
1060 | sgInfoList = []
1061 | for sgId, row in libraryTable.iterrows():
1062 | sgInfo = parseSgId(sgId)
1063 |
1064 | #fix pam coordinates for -strand ??
1065 | if sgInfo['strand'] == '-':
1066 | sgInfo['pam coordinate'] = sgInfo['position'] #+ 1
1067 | else:
1068 | sgInfo['pam coordinate'] = sgInfo['position']
1069 |
1070 | sgInfoList.append(sgInfo)
1071 |
1072 | return pd.DataFrame(sgInfoList, index=libraryTable.index)
1073 |
1074 | def printNow(outputString):
1075 | sys.stdout.write(outputString)
1076 | sys.stdout.flush()
1077 |
--------------------------------------------------------------------------------
/Library_design_walkthrough.md:
--------------------------------------------------------------------------------
1 |
2 | 1. Learning sgRNA predictors from empirical data
3 | * Load scripts and empirical data
4 | * Generate TSS annotation using FANTOM dataset
5 | * Calculate parameters for empirical sgRNAs
6 | * Fit parameters
7 | 2. Applying machine learning model to predict sgRNA activity
8 | * Find all sgRNAs in genomic regions of interest
9 | * Predicting sgRNA activity
10 | 3. Construct sgRNA libraries
11 | * Score sgRNAs for off-target potential
12 | * Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering
13 | * Design negative controls matching the base composition of the library
14 | * Finalizing library design
15 |
16 | # 1. Learning sgRNA predictors from empirical data
17 | ## Load scripts and empirical data
18 |
19 |
20 | ```python
21 | import sys
22 | sys.path.insert(0, '../ScreenProcessing/')
23 | %run sgRNA_learning.py
24 | ```
25 |
26 |
27 | ```python
28 | genomeDict = loadGenomeAsDict('large_data_files/hg19.fa')
29 | ```
30 |
31 | Loading genome file...Done
32 |
33 |
34 |
35 | ```python
36 | #to use pre-calculated sgRNA activity score data (e.g. provided CRISPRi training data), load the following:
37 | #CRISPRa activity score data also included in data_files
38 | libraryTable_training = pd.read_csv('data_files/CRISPRi_trainingdata_libraryTable.txt', sep='\t', index_col = 0)
39 | libraryTable_training.head()
40 | ```
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 | |
50 | sublibrary |
51 | gene |
52 | transcripts |
53 | sequence |
54 |
55 |
56 | | sgId |
57 | |
58 | |
59 | |
60 | |
61 |
62 |
63 |
64 |
65 | | Drug_Targets+Kinase_Phosphatase=AARS_+_70323216.24-all~e39m1 |
66 | drug_targets+kinase_phosphatase |
67 | AARS |
68 | all |
69 | GCCCCAGGATCAGGCCCCGCG |
70 |
71 |
72 | | Drug_Targets+Kinase_Phosphatase=AARS_+_70323296.24-all~e39m1 |
73 | drug_targets+kinase_phosphatase |
74 | AARS |
75 | all |
76 | GGCCGCCCTCGGAGAGCTCTG |
77 |
78 |
79 | | Drug_Targets+Kinase_Phosphatase=AARS_+_70323318.24-all~e39m1 |
80 | drug_targets+kinase_phosphatase |
81 | AARS |
82 | all |
83 | GACGGCGACCCTAGGAGAGGT |
84 |
85 |
86 | | Drug_Targets+Kinase_Phosphatase=AARS_+_70323362.24-all~e39m1 |
87 | drug_targets+kinase_phosphatase |
88 | AARS |
89 | all |
90 | GGTGCAGCGGGCCCTTGGCGG |
91 |
92 |
93 | | Drug_Targets+Kinase_Phosphatase=AARS_+_70323441.24-all~e39m1 |
94 | drug_targets+kinase_phosphatase |
95 | AARS |
96 | all |
97 | GCGCTCTGATTGGACGGAGCG |
98 |
99 |
100 |
101 |
102 |
103 |
104 |
105 |
106 | ```python
107 | sgInfoTable_training = pd.read_csv('data_files/CRISPRi_trainingdata_sgRNAInfoTable.txt', sep='\t', index_col=0)
108 | sgInfoTable_training.head()
109 | ```
110 |
111 |
112 |
113 |
114 |
115 |
116 |
117 |
118 | |
119 | Sublibrary |
120 | gene_name |
121 | length |
122 | pam coordinate |
123 | pass_score |
124 | position |
125 | strand |
126 | transcript_list |
127 |
128 |
129 | | sgId |
130 | |
131 | |
132 | |
133 | |
134 | |
135 | |
136 | |
137 | |
138 |
139 |
140 |
141 |
142 | | Drug_Targets+Kinase_Phosphatase=AARS_+_70323216.24-all~e39m1 |
143 | Drug_Targets+Kinase_Phosphatase |
144 | AARS |
145 | 24 |
146 | 70323216 |
147 | e39m1 |
148 | 70323216 |
149 | + |
150 | ['all'] |
151 |
152 |
153 | | Drug_Targets+Kinase_Phosphatase=AARS_+_70323296.24-all~e39m1 |
154 | Drug_Targets+Kinase_Phosphatase |
155 | AARS |
156 | 24 |
157 | 70323296 |
158 | e39m1 |
159 | 70323296 |
160 | + |
161 | ['all'] |
162 |
163 |
164 | | Drug_Targets+Kinase_Phosphatase=AARS_+_70323318.24-all~e39m1 |
165 | Drug_Targets+Kinase_Phosphatase |
166 | AARS |
167 | 24 |
168 | 70323318 |
169 | e39m1 |
170 | 70323318 |
171 | + |
172 | ['all'] |
173 |
174 |
175 | | Drug_Targets+Kinase_Phosphatase=AARS_+_70323362.24-all~e39m1 |
176 | Drug_Targets+Kinase_Phosphatase |
177 | AARS |
178 | 24 |
179 | 70323362 |
180 | e39m1 |
181 | 70323362 |
182 | + |
183 | ['all'] |
184 |
185 |
186 | | Drug_Targets+Kinase_Phosphatase=AARS_+_70323441.24-all~e39m1 |
187 | Drug_Targets+Kinase_Phosphatase |
188 | AARS |
189 | 24 |
190 | 70323441 |
191 | e39m1 |
192 | 70323441 |
193 | + |
194 | ['all'] |
195 |
196 |
197 |
198 |
199 |
200 |
201 |
202 |
203 | ```python
204 | activityScores = pd.read_csv('data_files/CRISPRi_trainingdata_activityScores.txt',sep='\t',index_col=0, header=None).iloc[:,0]
205 | activityScores.head()
206 | ```
207 |
208 |
209 |
210 |
211 | 0
212 | Drug_Targets+Kinase_Phosphatase=AARS_+_70323216.24-all~e39m1 0.348892
213 | Drug_Targets+Kinase_Phosphatase=AARS_+_70323296.24-all~e39m1 0.912409
214 | Drug_Targets+Kinase_Phosphatase=AARS_+_70323318.24-all~e39m1 0.997242
215 | Drug_Targets+Kinase_Phosphatase=AARS_+_70323362.24-all~e39m1 0.962154
216 | Drug_Targets+Kinase_Phosphatase=AARS_+_70323441.24-all~e39m1 0.019320
217 | Name: 1, dtype: float64
218 |
219 |
220 |
221 |
222 | ```python
223 | tssTable = pd.read_csv('data_files/human_tssTable.txt',sep='\t', index_col=range(2))
224 | tssTable.head()
225 | ```
226 |
227 |
228 |
229 |
230 |
231 |
232 |
233 |
234 | |
235 | |
236 | position |
237 | strand |
238 | chromosome |
239 | cage peak ranges |
240 |
241 |
242 | | gene |
243 | transcripts |
244 | |
245 | |
246 | |
247 | |
248 |
249 |
250 |
251 |
252 | | A1BG |
253 | all |
254 | 58864864 |
255 | - |
256 | chr19 |
257 | [(58864822, 58864847), (58864848, 58864868)] |
258 |
259 |
260 | | A1CF |
261 | ENST00000373993.1 |
262 | 52619744 |
263 | - |
264 | chr10 |
265 | [] |
266 |
267 |
268 | | ENST00000374001.2,ENST00000373997.3,ENST00000282641.2 |
269 | 52645434 |
270 | - |
271 | chr10 |
272 | [(52645379, 52645393), (52645416, 52645444)] |
273 |
274 |
275 | | A2M |
276 | all |
277 | 9268752 |
278 | - |
279 | chr12 |
280 | [(9268547, 9268556), (9268559, 9268568), (9268... |
281 |
282 |
283 | | A2ML1 |
284 | all |
285 | 8975067 |
286 | + |
287 | chr12 |
288 | [(8975061, 8975072), (8975101, 8975108), (8975... |
289 |
290 |
291 |
292 |
293 |
294 |
295 |
296 |
297 | ```python
298 | p1p2Table = pd.read_csv('data_files/human_p1p2Table.txt',sep='\t', header=0, index_col=range(2))
299 | p1p2Table['primary TSS'] = p1p2Table['primary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0].split('.')[0]), int(tupString.strip('()').split(', ')[1].split('.')[0])))
300 | p1p2Table['secondary TSS'] = p1p2Table['secondary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0].split('.')[0]),int(tupString.strip('()').split(', ')[1].split('.')[0])))
301 | p1p2Table.head()
302 | ```
303 |
304 |
305 |
306 |
307 |
308 |
309 |
310 |
311 | |
312 | |
313 | chromosome |
314 | strand |
315 | TSS source |
316 | primary TSS |
317 | secondary TSS |
318 |
319 |
320 | | gene |
321 | transcript |
322 | |
323 | |
324 | |
325 | |
326 | |
327 |
328 |
329 |
330 |
331 | | A1BG |
332 | P1 |
333 | chr19 |
334 | - |
335 | CAGE, matched peaks |
336 | (58858938, 58859039) |
337 | (58858938, 58859039) |
338 |
339 |
340 | | P2 |
341 | chr19 |
342 | - |
343 | CAGE, matched peaks |
344 | (58864822, 58864847) |
345 | (58864822, 58864847) |
346 |
347 |
348 | | A1CF |
349 | P1P2 |
350 | chr10 |
351 | - |
352 | CAGE, matched peaks |
353 | (52645379, 52645393) |
354 | (52645379, 52645393) |
355 |
356 |
357 | | A2M |
358 | P1P2 |
359 | chr12 |
360 | - |
361 | CAGE, matched peaks |
362 | (9268507, 9268523) |
363 | (9268528, 9268542) |
364 |
365 |
366 | | A2ML1 |
367 | P1P2 |
368 | chr12 |
369 | + |
370 | CAGE, matched peaks |
371 | (8975206, 8975223) |
372 | (8975144, 8975169) |
373 |
374 |
375 |
376 |
377 |
378 |
379 |
380 | ## Calculate parameters for empirical sgRNAs
381 |
382 | ### Because scikit-learn currently does not support any robust method for saving and re-loading the machine learning model, the best strategy is to simply re-learn the model from the training data
383 |
384 |
385 | ```python
386 | #Load bigwig files for any chromatin data of interest
387 | bwhandleDict = {'dnase':BigWigFile(open('large_data_files/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')),
388 | 'faire':BigWigFile(open('large_data_files/wgEncodeOpenChromFaireK562Sig.bigWig')),
389 | 'mnase':BigWigFile(open('large_data_files/wgEncodeSydhNsomeK562Sig.bigWig'))}
390 | ```
391 |
392 |
393 | ```python
394 | paramTable_trainingGuides = generateTypicalParamTable(libraryTable_training,sgInfoTable_training, tssTable, p1p2Table, genomeDict, bwhandleDict)
395 | ```
396 |
397 | ....
398 |
399 | /usr/local/lib/python2.7/dist-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
400 | warnings.warn("Mean of empty slice", RuntimeWarning)
401 | /usr/local/lib/python2.7/dist-packages/numpy/lib/nanfunctions.py:326: RuntimeWarning: All-NaN slice encountered
402 | warnings.warn("All-NaN slice encountered", RuntimeWarning)
403 |
404 |
405 | .Done!
406 |
407 | ## Fit parameters
408 |
409 |
410 | ```python
411 | #load in the 5-fold cross-validation splits used to generate the model
412 | import cPickle
413 | with open('data_files/CRISPRi_trainingdata_traintestsets.txt') as infile:
414 | geneFold_train, geneFold_test, fitTable = cPickle.load(infile)
415 | ```
416 |
417 |
418 | ```python
419 | transformedParams_train, estimators = fitParams(paramTable_trainingGuides.loc[activityScores.dropna().index].iloc[geneFold_train], activityScores.loc[activityScores.dropna().index].iloc[geneFold_train], fitTable)
420 |
421 | transformedParams_test = transformParams(paramTable_trainingGuides.loc[activityScores.dropna().index].iloc[geneFold_test], fitTable, estimators)
422 |
423 | reg = linear_model.ElasticNetCV(l1_ratio=[.5, .75, .9, .99,1], n_jobs=16, max_iter=2000)
424 |
425 | scaler = preprocessing.StandardScaler()
426 | reg.fit(scaler.fit_transform(transformedParams_train), activityScores.loc[activityScores.dropna().index].iloc[geneFold_train])
427 | predictedScores = pd.Series(reg.predict(scaler.transform(transformedParams_test)), index=transformedParams_test.index)
428 | testScores = activityScores.loc[activityScores.dropna().index].iloc[geneFold_test]
429 |
430 | print 'Prediction AUC-ROC:', metrics.roc_auc_score((testScores >= .75).values, np.array(predictedScores.values,dtype='float64'))
431 | print 'Prediction R^2:', reg.score(scaler.transform(transformedParams_test), testScores)
432 | print 'Regression parameters:', reg.l1_ratio_, reg.alpha_
433 | coefs = pd.DataFrame(zip(*[abs(reg.coef_),reg.coef_]), index = transformedParams_test.columns, columns=['abs','true'])
434 | print 'Number of features used:', len(coefs) - sum(coefs['abs'] < .00000000001)
435 | ```
436 |
437 | ('distance', 'primary TSS-Up') {'C': 0.05, 'gamma': 0.0001}
438 | ('distance', 'primary TSS-Down') {'C': 0.5, 'gamma': 5e-05}
439 | ('distance', 'secondary TSS-Up') {'C': 0.1, 'gamma': 5e-05}
440 | ('distance', 'secondary TSS-Down') {'C': 0.1, 'gamma': 5e-05}
441 |
442 |
443 | /home/mhorlbeck/.local/lib/python2.7/site-packages/sklearn/utils/validation.py:323: UserWarning: StandardScaler assumes floating point values as input, got object
444 | "got %s" % (estimator, X.dtype))
445 | /home/mhorlbeck/.local/lib/python2.7/site-packages/sklearn/linear_model/base.py:400: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
446 | if precompute == 'auto':
447 |
448 |
449 | Prediction AUC-ROC: 0.803109696478
450 | Prediction R^2: 0.31263687609
451 | Regression parameters: 0.5 0.00534455043278
452 | Number of features used: 327
453 |
454 |
455 |
456 | ```python
457 | #can save state for reproducing estimators later
458 | #the pickling of the scikit-learn estimators/regressors will allow the model to be reloaded for prediction of other guide designs,
459 | # but will not be compatible across scikit-learn versions, so it is important to preserve the training data and training/test folds
460 | import cPickle
461 | estimatorString = cPickle.dumps((fitTable, estimators, scaler, reg))
462 | with open(PICKLE_FILE,'w') as outfile:
463 | outfile.write(estimatorString)
464 |
465 | #also save the transformed parameters as these can slightly differ based on the automated binning strategy
466 | transformedParams_train.head().to_csv(TRANSFORMED_PARAM_HEADER,sep='\t')
467 | ```
468 |
469 | # 2. Applying machine learning model to predict sgRNA activity
470 |
471 | ## Generate TSS annotation using FANTOM dataset
472 |
473 |
474 | ```python
475 | #you can supply any table of gene transcription start sites formatted as below
476 | #for demonstration purposes, the rest of this walkthrough will use a small arbitrary subset of the protein coding TSS table
477 | tssTable_new = tssTable.iloc[10:20, :-1]
478 | tssTable_new.head()
479 | ```
480 |
481 |
482 |
483 |
484 |
485 |
486 |
487 |
488 | |
489 | |
490 | position |
491 | strand |
492 | chromosome |
493 |
494 |
495 | | gene |
496 | transcripts |
497 | |
498 | |
499 | |
500 |
501 |
502 |
503 |
504 | | AADACL2 |
505 | all |
506 | 151451714 |
507 | + |
508 | chr3 |
509 |
510 |
511 | | AADAT |
512 | ENST00000337664.4 |
513 | 171011117 |
514 | - |
515 | chr4 |
516 |
517 |
518 | | ENST00000337664.4,ENST00000509167.1,ENST00000353187.2 |
519 | 171011284 |
520 | - |
521 | chr4 |
522 |
523 |
524 | | ENST00000509167.1,ENST00000515480.1,ENST00000353187.2 |
525 | 171011424 |
526 | - |
527 | chr4 |
528 |
529 |
530 | | AAED1 |
531 | all |
532 | 99417562 |
533 | - |
534 | chr9 |
535 |
536 |
537 |
538 |
539 |
540 |
541 |
542 |
543 | ```python
544 | #if desired, use the ensembl annotation and the HGNC database to supply gene aliases to assist P1P2 matching in the next step
545 | gencodeData = loadGencodeData('large_data_files/gencode.v19.annotation.gtf')
546 | geneToAliases = generateAliasDict('large_data_files/20150424_HGNC_symbols.txt',gencodeData)
547 | ```
548 |
549 | Loading annotation file...Done
550 |
551 |
552 |
553 | ```python
554 | #Now create a TSS annotation by searching for P1 and P2 peaks near annotated TSSs
555 | #same parameters as for our lncRNA libraries
556 | p1p2Table_new = generateTssTable_P1P2strategy(tssTable_new, 'large_data_files/TSS_human.sorted.bed.gz',
557 | matchedp1p2Window = 30000, #region around supplied TSS annotation to search for a FANTOM P1 or P2 peak that matches the gene name (or alias)
558 | anyp1p2Window = 500, #region around supplied TSS annotation to search for the nearest P1 or P2 peak
559 | anyPeakWindow = 200, #region around supplied TSS annotation to search for any CAGE peak
560 | minDistanceForTwoTSS = 1000, #If a P1 and P2 peak are found, maximum distance at which to combine into a single annotation (with primary/secondary TSS positions)
561 | aliasDict = geneToAliases[0])
562 | #the function will report some collisions of IDs due to use of aliases and redundancy in genome, but will resolve these itself
563 | ```
564 |
565 |
566 | ```python
567 | p1p2Table_new.head()
568 | ```
569 |
570 |
571 |
572 |
573 |
574 |
575 |
576 |
577 | |
578 | |
579 | chromosome |
580 | strand |
581 | TSS source |
582 | primary TSS |
583 | secondary TSS |
584 |
585 |
586 | | gene |
587 | transcript |
588 | |
589 | |
590 | |
591 | |
592 | |
593 |
594 |
595 |
596 |
597 | | AADACL2 |
598 | P1P2 |
599 | chr3 |
600 | + |
601 | CAGE, matched peaks |
602 | (151451707, 151451722) |
603 | (151451707, 151451722) |
604 |
605 |
606 | | AADAT |
607 | P1P2 |
608 | chr4 |
609 | - |
610 | CAGE, matched peaks |
611 | (171011323, 171011408) |
612 | (171011084, 171011147) |
613 |
614 |
615 | | AAED1 |
616 | P1P2 |
617 | chr9 |
618 | - |
619 | CAGE, matched peaks |
620 | (99417562, 99417609) |
621 | (99417615, 99417622) |
622 |
623 |
624 | | AAGAB |
625 | P1P2 |
626 | chr15 |
627 | - |
628 | CAGE, matched peaks |
629 | (67546963, 67547024) |
630 | (67546963, 67547024) |
631 |
632 |
633 | | AAK1 |
634 | P1P2 |
635 | chr2 |
636 | - |
637 | CAGE, matched peaks |
638 | (69870747, 69870812) |
639 | (69870854, 69870878) |
640 |
641 |
642 |
643 |
644 |
645 |
646 |
647 |
648 | ```python
649 | p1p2Table_new.groupby('TSS source').agg(len).iloc[:,[2]]
650 | ```
651 |
652 |
653 |
654 |
655 |
656 |
657 |
658 |
659 | |
660 | primary TSS |
661 |
662 |
663 | | TSS source |
664 | |
665 |
666 |
667 |
668 |
669 | | CAGE, matched peaks |
670 | 8 |
671 |
672 |
673 |
674 |
675 |
676 |
677 |
678 |
679 | ```python
680 | len(p1p2Table_new)
681 | ```
682 |
683 |
684 |
685 |
686 | 8
687 |
688 |
689 |
690 |
691 | ```python
692 | #save tables
693 | tssTable_alllncs.to_csv(TSS_TABLE_PATH,sep='\t', index_col=range(2))
694 | p1p2Table_alllncs.to_csv(P1P2_TABLE_PATH,sep='\t', header=0, index_col=range(2))
695 | ```
696 |
697 | ## Find all sgRNAs in genomic regions of interest
698 |
699 |
700 | ```python
701 | libraryTable_new, sgInfoTable_new = findAllGuides(p1p2Table_new, genomeDict,
702 | (-25,500)) #region around P1P2 TSSs to search for new sgRNAs; recommend -550,-25 for CRISPRa
703 | ```
704 |
705 |
706 | ```python
707 | len(libraryTable_new)
708 | ```
709 |
710 |
711 |
712 |
713 | 1125
714 |
715 |
716 |
717 |
718 | ```python
719 | libraryTable_new.head()
720 | ```
721 |
722 |
723 |
724 |
725 |
726 |
727 |
728 |
729 | |
730 | gene |
731 | transcripts |
732 | sequence |
733 | genomic sequence |
734 |
735 |
736 | | sgId |
737 | |
738 | |
739 | |
740 | |
741 |
742 |
743 |
744 |
745 | | AADACL2_+_151451720.23-P1P2 |
746 | AADACL2 |
747 | P1P2 |
748 | GTAGACTTGGGAACTCTCTC |
749 | CCTGAGAGAGTTCCCAAGTCTAC |
750 |
751 |
752 | | AADACL2_+_151451732.23-P1P2 |
753 | AADACL2 |
754 | P1P2 |
755 | GGTAGAGCAATTGTAGACTT |
756 | CCCAAGTCTACAATTGCTCTACT |
757 |
758 |
759 | | AADACL2_+_151451733.23-P1P2 |
760 | AADACL2 |
761 | P1P2 |
762 | GAGTAGAGCAATTGTAGACT |
763 | CCAAGTCTACAATTGCTCTACTA |
764 |
765 |
766 | | AADACL2_-_151451809.23-P1P2 |
767 | AADACL2 |
768 | P1P2 |
769 | GCTCAGTACTGTGAAGAAGC |
770 | TCTCAGTACTGTGAAGAAGCTGG |
771 |
772 |
773 | | AADACL2_-_151451816.23-P1P2 |
774 | AADACL2 |
775 | P1P2 |
776 | GCTGTGAAGAAGCTGGAAAA |
777 | ACTGTGAAGAAGCTGGAAAAAGG |
778 |
779 |
780 |
781 |
782 |
783 |
784 |
785 | ## Predicting sgRNA activity
786 |
787 |
788 | ```python
789 | #calculate parameters for new sgRNAs
790 | paramTable_new = generateTypicalParamTable(libraryTable_new, sgInfoTable_new, tssTable_new, p1p2Table_new, genomeDict, bwhandleDict)
791 | ```
792 |
793 | .....Done!
794 |
795 |
796 | ```python
797 | paramTable_new.head()
798 | ```
799 |
800 |
801 |
802 |
803 |
804 |
805 |
806 |
807 | |
808 | length |
809 | distance |
810 | homopolymers |
811 | base fractions |
812 | ... |
813 | RNA folding-pairing, no scaffold |
814 |
815 |
816 | |
817 | length |
818 | primary TSS-Up |
819 | primary TSS-Down |
820 | secondary TSS-Up |
821 | secondary TSS-Down |
822 | A |
823 | G |
824 | C |
825 | T |
826 | A |
827 | ... |
828 | -12 |
829 | -11 |
830 | -10 |
831 | -9 |
832 | -8 |
833 | -7 |
834 | -6 |
835 | -5 |
836 | -4 |
837 | -3 |
838 |
839 |
840 | | sgId |
841 | |
842 | |
843 | |
844 | |
845 | |
846 | |
847 | |
848 | |
849 | |
850 | |
851 | |
852 | |
853 | |
854 | |
855 | |
856 | |
857 | |
858 | |
859 | |
860 | |
861 | |
862 |
863 |
864 |
865 |
866 | | AADACL2_+_151451720.23-P1P2 |
867 | 20 |
868 | 13 |
869 | -2 |
870 | 13 |
871 | -2 |
872 | 2 |
873 | 3 |
874 | 1 |
875 | 2 |
876 | 0.20 |
877 | ... |
878 | True |
879 | True |
880 | False |
881 | False |
882 | False |
883 | False |
884 | True |
885 | True |
886 | True |
887 | True |
888 |
889 |
890 | | AADACL2_+_151451732.23-P1P2 |
891 | 20 |
892 | 25 |
893 | 10 |
894 | 25 |
895 | 10 |
896 | 2 |
897 | 2 |
898 | 1 |
899 | 2 |
900 | 0.30 |
901 | ... |
902 | False |
903 | False |
904 | False |
905 | False |
906 | False |
907 | False |
908 | False |
909 | False |
910 | False |
911 | False |
912 |
913 |
914 | | AADACL2_+_151451733.23-P1P2 |
915 | 20 |
916 | 26 |
917 | 11 |
918 | 26 |
919 | 11 |
920 | 2 |
921 | 1 |
922 | 1 |
923 | 2 |
924 | 0.35 |
925 | ... |
926 | False |
927 | False |
928 | False |
929 | False |
930 | False |
931 | False |
932 | False |
933 | False |
934 | False |
935 | False |
936 |
937 |
938 | | AADACL2_-_151451809.23-P1P2 |
939 | 20 |
940 | 102 |
941 | 87 |
942 | 102 |
943 | 87 |
944 | 2 |
945 | 1 |
946 | 1 |
947 | 1 |
948 | 0.30 |
949 | ... |
950 | False |
951 | True |
952 | True |
953 | True |
954 | False |
955 | False |
956 | False |
957 | False |
958 | False |
959 | False |
960 |
961 |
962 | | AADACL2_-_151451816.23-P1P2 |
963 | 20 |
964 | 109 |
965 | 94 |
966 | 109 |
967 | 94 |
968 | 4 |
969 | 2 |
970 | 1 |
971 | 1 |
972 | 0.40 |
973 | ... |
974 | True |
975 | True |
976 | True |
977 | False |
978 | False |
979 | False |
980 | False |
981 | False |
982 | False |
983 | False |
984 |
985 |
986 |
987 |
5 rows × 808 columns
988 |
989 |
990 |
991 |
992 |
993 | ```python
994 | #if starting from a separate session from where you ran the sgRNA learning steps of Part 1, reload the following
995 | import cPickle
996 | with open(PICKLE_FILE) as infile:
997 | fitTable, estimators, scaler, reg = cPickle.load(infile)
998 |
999 | transformedParams_train = pd.read_csv(TRANSFORMED_PARAM_HEADER,sep='\t')
1000 | ```
1001 |
1002 |
1003 | ```python
1004 | #transform and predict scores according to sgRNA prediction model
1005 | transformedParams_new = transformParams(paramTable_new, fitTable, estimators)
1006 |
1007 | #reconcile any differences in column headers generated by automated binning
1008 | colTups = []
1009 | for (l1, l2), col in transformedParams_new.iteritems():
1010 | colTups.append((l1,str(l2)))
1011 | transformedParams_new.columns = pd.MultiIndex.from_tuples(colTups)
1012 |
1013 | predictedScores_new = pd.Series(reg.predict(scaler.transform(transformedParams_new.loc[:, transformedParams_train.columns].fillna(0).values)), index=transformedParams_new.index)
1014 | ```
1015 |
1016 |
1017 | ```python
1018 | predictedScores_new.head()
1019 | ```
1020 |
1021 |
1022 |
1023 |
1024 | sgId
1025 | AADACL2_+_151451720.23-P1P2 0.641245
1026 | AADACL2_+_151451732.23-P1P2 0.693926
1027 | AADACL2_+_151451733.23-P1P2 0.655759
1028 | AADACL2_-_151451809.23-P1P2 0.500835
1029 | AADACL2_-_151451816.23-P1P2 0.434376
1030 | dtype: float64
1031 |
1032 |
1033 |
1034 |
1035 | ```python
1036 | libraryTable_new.to_csv(LIBRARY_TABLE_PATH,sep='\t')
1037 | sgInfoTable_new.to_csv(sgRNA_INFO_PATH,sep='\t')
1038 | predictedScores_new.to_csv(PREDICTED_SCORES_PATH, sep='\t')
1039 | ```
1040 |
1041 | # 3. Construct sgRNA libraries
1042 | ## Score sgRNAs for off-target potential
1043 |
1044 | ### There are many ways to score sgRNAs as off-target; below is one listed one method that is simple and flexible, but ignores gapped alignments, alternate PAMs, and uses bowtie which may not be maximally sensitive in all cases
1045 |
1046 |
1047 | ```python
1048 | !mkdir temp_bowtie_files
1049 | ```
1050 |
1051 |
1052 | ```python
1053 | #output all sequences to a temporary FASTQ file for running bowtie alignment
1054 | fqFile = 'temp_bowtie_files/bowtie_input.fq'
1055 |
1056 | def outputTempBowtieFastq(libraryTable, outputFileName):
1057 | phredString = 'I4!=======44444+++++++' #weighting for how impactful mismatches are along sgRNA sequence
1058 | with open(outputFileName,'w') as outfile:
1059 | for name, row in libraryTable.iterrows():
1060 | outfile.write('@' + name + '\n')
1061 | outfile.write('CCN' + str(Seq.Seq(row['sequence'][1:]).reverse_complement()) + '\n')
1062 | outfile.write('+\n')
1063 | outfile.write(phredString + '\n')
1064 |
1065 | outputTempBowtieFastq(libraryTable_new, fqFile)
1066 | ```
1067 |
1068 |
1069 | ```python
1070 | import subprocess
1071 |
1072 | #specifying a list of parameters to run bowtie with
1073 | #each tuple contains
1074 | # *the mismatch threshold below which a site is considered a potential off-target (higher is more stringent)
1075 | # *the number of sites allowed (1 is minimum since each sgRNA should have one true site in genome)
1076 | # *the genome index against which to align the sgRNA sequences; these can be custom built to only consider sites near TSSs
1077 | # *a name for the bowtie run to create appropriately named output files
1078 | alignmentList = [(39,1,'large_data_files/hg19.ensemblTSSflank500b','39_nearTSS'),
1079 | (31,1,'large_data_files/hg19.ensemblTSSflank500b','31_nearTSS'),
1080 | (21,1,'large_data_files/hg19_maskChrMandPAR','21_genome'),
1081 | (31,2,'large_data_files/hg19.ensemblTSSflank500b','31_2_nearTSS'),
1082 | (31,3,'large_data_files/hg19.ensemblTSSflank500b','31_3_nearTSS')]
1083 |
1084 | alignmentColumns = []
1085 | for btThreshold, mflag, bowtieIndex, runname in alignmentList:
1086 |
1087 | alignedFile = 'temp_bowtie_files/' + runname + '_aligned.txt'
1088 | unalignedFile = 'temp_bowtie_files/' + runname + '_unaligned.fq'
1089 | maxFile = 'temp_bowtie_files/' + runname + '_max.fq'
1090 |
1091 | bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile
1092 | print bowtieString
1093 | print subprocess.call(bowtieString, shell=True) #0 means finished without errors
1094 |
1095 | #parse through the file of sgRNAs that exceeded "m", the maximum allowable alignments, and mark "True" any that are found
1096 | try:
1097 | with open(maxFile) as infile:
1098 | sgsAligning = set()
1099 | for i, line in enumerate(infile):
1100 | if i%4 == 0: #id line
1101 | sgsAligning.add(line.strip()[1:])
1102 | except IOError: #no sgRNAs exceeded m, so no maxFile created
1103 | sgsAligning = set()
1104 |
1105 | alignmentColumns.append(libraryTable_new.apply(lambda row: row.name in sgsAligning, axis=1))
1106 |
1107 | #collate results into a table, and flip the boolean values to yield the sgRNAs that passed filter as True
1108 | alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3]).ne(True)
1109 | ```
1110 |
1111 | bowtie -n 3 -l 15 -e 39 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/39_nearTSS_unaligned.fq --max temp_bowtie_files/39_nearTSS_max.fq -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/39_nearTSS_aligned.txt
1112 | 0
1113 | bowtie -n 3 -l 15 -e 31 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/31_nearTSS_unaligned.fq --max temp_bowtie_files/31_nearTSS_max.fq -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/31_nearTSS_aligned.txt
1114 | 0
1115 | bowtie -n 3 -l 15 -e 21 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19_maskChrMandPAR --suppress 5,6,7 --un temp_bowtie_files/21_genome_unaligned.fq --max temp_bowtie_files/21_genome_max.fq -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/21_genome_aligned.txt
1116 | 0
1117 | bowtie -n 3 -l 15 -e 31 -m 2 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/31_2_nearTSS_unaligned.fq --max temp_bowtie_files/31_2_nearTSS_max.fq -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/31_2_nearTSS_aligned.txt
1118 | 0
1119 | bowtie -n 3 -l 15 -e 31 -m 3 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/31_3_nearTSS_unaligned.fq --max temp_bowtie_files/31_3_nearTSS_max.fq -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/31_3_nearTSS_aligned.txt
1120 | 0
1121 |
1122 |
1123 |
1124 | ```python
1125 | alignmentTable.head() #True = passed threshold
1126 | ```
1127 |
1128 |
1129 |
1130 |
1131 |
1132 |
1133 |
1134 |
1135 | |
1136 | 39_nearTSS |
1137 | 31_nearTSS |
1138 | 21_genome |
1139 | 31_2_nearTSS |
1140 | 31_3_nearTSS |
1141 |
1142 |
1143 | | sgId |
1144 | |
1145 | |
1146 | |
1147 | |
1148 | |
1149 |
1150 |
1151 |
1152 |
1153 | | AADACL2_+_151451720.23-P1P2 |
1154 | True |
1155 | True |
1156 | False |
1157 | True |
1158 | True |
1159 |
1160 |
1161 | | AADACL2_+_151451732.23-P1P2 |
1162 | True |
1163 | True |
1164 | True |
1165 | True |
1166 | True |
1167 |
1168 |
1169 | | AADACL2_+_151451733.23-P1P2 |
1170 | True |
1171 | True |
1172 | True |
1173 | True |
1174 | True |
1175 |
1176 |
1177 | | AADACL2_-_151451809.23-P1P2 |
1178 | False |
1179 | True |
1180 | False |
1181 | True |
1182 | True |
1183 |
1184 |
1185 | | AADACL2_-_151451816.23-P1P2 |
1186 | True |
1187 | True |
1188 | False |
1189 | True |
1190 | True |
1191 |
1192 |
1193 |
1194 |
1195 |
1196 |
1197 |
1198 | ## Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering
1199 |
1200 |
1201 | ```python
1202 | #combine all generated data into one master table
1203 | predictedScores_new.name = 'predicted score'
1204 | v2Table = pd.concat((libraryTable_new, predictedScores_new, alignmentTable, sgInfoTable_new), axis=1, keys=['library table v2', 'predicted score', 'off-target filters', 'sgRNA info'])
1205 | ```
1206 |
1207 |
1208 | ```python
1209 | import re
1210 | #for our pCRISPRi/a-v2 vector, we append flanking sequences to each sgRNA sequence for cloning and require the oligo to contain
1211 | #exactly 1 BstXI and BlpI site each for cloning, and exactly 0 SbfI sites for sequencing sample preparation
1212 | restrictionSites = {re.compile('CCA......TGG'):1,
1213 | re.compile('GCT.AGC'):1,
1214 | re.compile('CCTGCAGG'):0}
1215 |
1216 | def matchREsites(sequence, REdict):
1217 | seq = sequence.upper()
1218 | for resite, numMatchesExpected in restrictionSites.iteritems():
1219 | if len(resite.findall(seq)) != numMatchesExpected:
1220 | return False
1221 |
1222 | return True
1223 |
1224 | def checkOverlaps(leftPosition, acceptedLeftPositions, nonoverlapMin):
1225 | for pos in acceptedLeftPositions:
1226 | if abs(pos - leftPosition) < nonoverlapMin:
1227 | return False
1228 | return True
1229 | ```
1230 |
1231 |
1232 | ```python
1233 | #flanking sequences
1234 | upstreamConstant = 'CCACCTTGTTG'
1235 | downstreamConstant = 'GTTTAAGAGCTAAGCTG'
1236 |
1237 | #minimum overlap between two sgRNAs targeting the same TSS
1238 | nonoverlapMin = 3
1239 |
1240 | #number of sgRNAs to pick per gene/TSS
1241 | sgRNAsToPick = 10
1242 |
1243 | #list of off-target filter (or combinations of filters) levels, matching the names in the alignment table above
1244 | offTargetLevels = [['31_nearTSS', '21_genome'],
1245 | ['31_nearTSS'],
1246 | ['21_genome'],
1247 | ['31_2_nearTSS'],
1248 | ['31_3_nearTSS']]
1249 |
1250 | #for each gene/TSS, go through each sgRNA in descending order of predicted score
1251 | #if an sgRNA passes the restriction site, overlap, and off-target filters, accept it into the library
1252 | #if the number of sgRNAs accepted is less than sgRNAsToPick, reduce off-target stringency by one and continue
1253 | v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')])
1254 | newSgIds = []
1255 | unfinishedTss = []
1256 | for (gene, transcript), group in v2Groups:
1257 | geneSgIds = []
1258 | geneLeftPositions = []
1259 | empiricalSgIds = dict()
1260 |
1261 | stringency = 0
1262 |
1263 | while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels):
1264 | for sgId_v2, row in group.sort_values(('predicted score','predicted score'), ascending=False).iterrows():
1265 | oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant
1266 | leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)
1267 | if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \
1268 | and matchREsites(oligoSeq, restrictionSites) \
1269 | and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):
1270 | geneSgIds.append((sgId_v2,
1271 | gene,transcript,
1272 | row[('library table v2','sequence')], oligoSeq,
1273 | row[('predicted score','predicted score')], np.nan,
1274 | stringency))
1275 | geneLeftPositions.append(leftPos)
1276 |
1277 | stringency += 1
1278 |
1279 | if len(geneSgIds) < sgRNAsToPick:
1280 | unfinishedTss.append((gene, transcript)) #if the number of accepted sgRNAs is still less than sgRNAsToPick, discard gene
1281 | else:
1282 | newSgIds.extend(geneSgIds)
1283 |
1284 | libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence',
1285 | 'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID')
1286 | ```
1287 |
1288 |
1289 | ```python
1290 | print len(libraryTable_complete)
1291 | ```
1292 |
1293 | 80
1294 |
1295 |
1296 |
1297 | ```python
1298 | #number of sgRNAs accepted at each stringency level
1299 | libraryTable_complete.groupby('off-target stringency').agg(len).iloc[:,0]
1300 | ```
1301 |
1302 |
1303 |
1304 |
1305 | off-target stringency
1306 | 0 80
1307 | Name: gene, dtype: int64
1308 |
1309 |
1310 |
1311 |
1312 | ```python
1313 | #number of TSSs with fewer than required number of sgRNAs (and thus not included in the library)
1314 | print len(unfinishedTss)
1315 | ```
1316 |
1317 | 0
1318 |
1319 |
1320 |
1321 | ```python
1322 | libraryTable_complete.head()
1323 | ```
1324 |
1325 |
1326 |
1327 |
1328 |
1329 |
1330 |
1331 |
1332 | |
1333 | gene |
1334 | transcript |
1335 | protospacer sequence |
1336 | oligo sequence |
1337 | predicted score |
1338 | empirical score |
1339 | off-target stringency |
1340 |
1341 |
1342 | | sgID |
1343 | |
1344 | |
1345 | |
1346 | |
1347 | |
1348 | |
1349 | |
1350 |
1351 |
1352 |
1353 |
1354 | | AADACL2_+_151451732.23-P1P2 |
1355 | AADACL2 |
1356 | P1P2 |
1357 | GGTAGAGCAATTGTAGACTT |
1358 | CCACCTTGTTGGGTAGAGCAATTGTAGACTTGTTTAAGAGCTAAGCTG |
1359 | 0.693926 |
1360 | NaN |
1361 | 0 |
1362 |
1363 |
1364 | | AADACL2_-_151452019.23-P1P2 |
1365 | AADACL2 |
1366 | P1P2 |
1367 | GATGACTTATTGACTAAAAA |
1368 | CCACCTTGTTGGATGACTTATTGACTAAAAAGTTTAAGAGCTAAGCTG |
1369 | 0.451392 |
1370 | NaN |
1371 | 0 |
1372 |
1373 |
1374 | | AADACL2_+_151452121.23-P1P2 |
1375 | AADACL2 |
1376 | P1P2 |
1377 | GACTGTTACTCACAGATATA |
1378 | CCACCTTGTTGGACTGTTACTCACAGATATAGTTTAAGAGCTAAGCTG |
1379 | 0.426695 |
1380 | NaN |
1381 | 0 |
1382 |
1383 |
1384 | | AADACL2_-_151451828.23-P1P2 |
1385 | AADACL2 |
1386 | P1P2 |
1387 | GTGGAAAAAGGGATATTATG |
1388 | CCACCTTGTTGGTGGAAAAAGGGATATTATGGTTTAAGAGCTAAGCTG |
1389 | 0.404655 |
1390 | NaN |
1391 | 0 |
1392 |
1393 |
1394 | | AADACL2_-_151451931.23-P1P2 |
1395 | AADACL2 |
1396 | P1P2 |
1397 | GAGCTGGAAAATAATGGCCT |
1398 | CCACCTTGTTGGAGCTGGAAAATAATGGCCTGTTTAAGAGCTAAGCTG |
1399 | 0.404269 |
1400 | NaN |
1401 | 0 |
1402 |
1403 |
1404 |
1405 |
1406 |
1407 |
1408 |
1409 | # 5. Design negative controls matching the base composition of the library
1410 |
1411 |
1412 | ```python
1413 | #calcluate the base frequency at each position of the sgRNA, then generate random sequences weighted by this frequency
1414 | def getBaseFrequencies(libraryTable, baseConversion = {'G':0, 'C':1, 'T':2, 'A':3}):
1415 | baseArray = np.zeros((len(libraryTable),20))
1416 |
1417 | for i, (index, seq) in enumerate(libraryTable['protospacer sequence'].iteritems()):
1418 | for j, char in enumerate(seq.upper()):
1419 | baseArray[i,j] = baseConversion[char]
1420 |
1421 | baseTable = pd.DataFrame(baseArray, index = libraryTable.index)
1422 |
1423 | baseFrequencies = baseTable.apply(lambda col: col.groupby(col).agg(len)).fillna(0) / len(baseTable)
1424 | baseFrequencies.index = ['G','C','T','A']
1425 |
1426 | baseCumulativeFrequencies = baseFrequencies.copy()
1427 | baseCumulativeFrequencies.loc['C'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C']
1428 | baseCumulativeFrequencies.loc['T'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T']
1429 | baseCumulativeFrequencies.loc['A'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T'] + baseFrequencies.loc['A']
1430 |
1431 | return baseFrequencies, baseCumulativeFrequencies
1432 |
1433 | def generateRandomSequence(baseCumulativeFrequencies):
1434 | randArray = np.random.random(baseCumulativeFrequencies.shape[1])
1435 |
1436 | seq = []
1437 | for i, col in baseCumulativeFrequencies.iteritems():
1438 | for base, freq in col.iteritems():
1439 | if randArray[i] < freq:
1440 | seq.append(base)
1441 | break
1442 |
1443 | return ''.join(seq)
1444 | ```
1445 |
1446 |
1447 | ```python
1448 | baseCumulativeFrequencies = getBaseFrequencies(libraryTable_complete)[1]
1449 | negList = []
1450 | numberToGenerate = 1000 #can generate many more; some will be filtered out by off-targets, and you can always select an arbitrary subset for inclusion into the library
1451 | for i in range(numberToGenerate):
1452 | negList.append(generateRandomSequence(baseCumulativeFrequencies))
1453 | negTable = pd.DataFrame(negList, index=['non-targeting_' + str(i) for i in range(numberToGenerate)], columns = ['sequence'])
1454 |
1455 | fqFile = 'temp_bowtie_files/bowtie_input_negs.fq'
1456 | outputTempBowtieFastq(negTable, fqFile)
1457 | ```
1458 |
1459 |
1460 | ```python
1461 | #similar to targeting sgRNA off-target scoring, but looking for sgRNAs with 0 alignments
1462 | alignmentList = [(31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS_negs'),
1463 | (21,1,'~/indices/hg19_maskChrMandPAR','21_genome_negs')]
1464 |
1465 | alignmentColumns = []
1466 | for btThreshold, mflag, bowtieIndex, runname in alignmentList:
1467 |
1468 | alignedFile = 'temp_bowtie_files/' + runname + '_aligned.txt'
1469 | unalignedFile = 'temp_bowtie_files/' + runname + '_unaligned.fq'
1470 | maxFile = 'temp_bowtie_files/' + runname + '_max.fq'
1471 |
1472 | bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile
1473 | print bowtieString
1474 | print subprocess.call(bowtieString, shell=True)
1475 |
1476 | #read unaligned file for negs, and then don't flip boolean of alignmentTable
1477 | with open(unalignedFile) as infile:
1478 | sgsAligning = set()
1479 | for i, line in enumerate(infile):
1480 | if i%4 == 0: #id line
1481 | sgsAligning.add(line.strip()[1:])
1482 |
1483 | alignmentColumns.append(negTable.apply(lambda row: row.name in sgsAligning, axis=1))
1484 |
1485 | alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3])
1486 | alignmentTable.head()
1487 | ```
1488 |
1489 | bowtie -n 3 -l 15 -e 31 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 ~/indices/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/31_nearTSS_negs_unaligned.fq --max temp_bowtie_files/31_nearTSS_negs_max.fq -q temp_bowtie_files/bowtie_input_negs.fq temp_bowtie_files/31_nearTSS_negs_aligned.txt
1490 | 0
1491 | bowtie -n 3 -l 15 -e 21 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 ~/indices/hg19_maskChrMandPAR --suppress 5,6,7 --un temp_bowtie_files/21_genome_negs_unaligned.fq --max temp_bowtie_files/21_genome_negs_max.fq -q temp_bowtie_files/bowtie_input_negs.fq temp_bowtie_files/21_genome_negs_aligned.txt
1492 | 0
1493 |
1494 |
1495 |
1496 |
1497 |
1498 |
1499 |
1500 |
1501 |
1502 | |
1503 | 31_nearTSS_negs |
1504 | 21_genome_negs |
1505 |
1506 |
1507 |
1508 |
1509 | | non-targeting_0 |
1510 | True |
1511 | True |
1512 |
1513 |
1514 | | non-targeting_1 |
1515 | True |
1516 | True |
1517 |
1518 |
1519 | | non-targeting_2 |
1520 | True |
1521 | True |
1522 |
1523 |
1524 | | non-targeting_3 |
1525 | True |
1526 | False |
1527 |
1528 |
1529 | | non-targeting_4 |
1530 | True |
1531 | True |
1532 |
1533 |
1534 |
1535 |
1536 |
1537 |
1538 |
1539 |
1540 | ```python
1541 | acceptedNegList = []
1542 | negCount = 0
1543 | for i, (name, row) in enumerate(pd.concat((negTable,alignmentTable),axis=1, keys=['seq','alignment']).iterrows()):
1544 | oligo = upstreamConstant + row['seq','sequence'] + downstreamConstant
1545 | if row['alignment'].all() and matchREsites(oligo, restrictionSites):
1546 | acceptedNegList.append(('non-targeting_%05d' % negCount, 'negative_control', 'na', row['seq','sequence'], oligo, 0))
1547 | negCount += 1
1548 |
1549 | acceptedNegs = pd.DataFrame(acceptedNegList, columns = ['sgId', 'gene', 'transcript', 'protospacer sequence', 'oligo sequence', 'off-target stringency']).set_index('sgId')
1550 | ```
1551 |
1552 |
1553 | ```python
1554 | acceptedNegs.head()
1555 | ```
1556 |
1557 |
1558 |
1559 |
1560 |
1561 |
1562 |
1563 |
1564 | |
1565 | gene |
1566 | transcript |
1567 | protospacer sequence |
1568 | oligo sequence |
1569 | off-target stringency |
1570 |
1571 |
1572 | | sgId |
1573 | |
1574 | |
1575 | |
1576 | |
1577 | |
1578 |
1579 |
1580 |
1581 |
1582 | | non-targeting_00000 |
1583 | negative_control |
1584 | na |
1585 | GGTTTCGCGCGCTTACAGAT |
1586 | CCACCTTGTTGGGTTTCGCGCGCTTACAGATGTTTAAGAGCTAAGCTG |
1587 | 0 |
1588 |
1589 |
1590 | | non-targeting_00001 |
1591 | negative_control |
1592 | na |
1593 | GGTGGTCGAAGATAGCGAGC |
1594 | CCACCTTGTTGGGTGGTCGAAGATAGCGAGCGTTTAAGAGCTAAGCTG |
1595 | 0 |
1596 |
1597 |
1598 | | non-targeting_00002 |
1599 | negative_control |
1600 | na |
1601 | GCTCTTGACAAATTCAAGCT |
1602 | CCACCTTGTTGGCTCTTGACAAATTCAAGCTGTTTAAGAGCTAAGCTG |
1603 | 0 |
1604 |
1605 |
1606 | | non-targeting_00003 |
1607 | negative_control |
1608 | na |
1609 | GGCCGGGAGAGCGGGAACTC |
1610 | CCACCTTGTTGGGCCGGGAGAGCGGGAACTCGTTTAAGAGCTAAGCTG |
1611 | 0 |
1612 |
1613 |
1614 | | non-targeting_00004 |
1615 | negative_control |
1616 | na |
1617 | GTCGCAAGCCGGGGTAGGGT |
1618 | CCACCTTGTTGGTCGCAAGCCGGGGTAGGGTGTTTAAGAGCTAAGCTG |
1619 | 0 |
1620 |
1621 |
1622 |
1623 |
1624 |
1625 |
1626 |
1627 |
1628 | ```python
1629 | libraryTable_complete.to_csv(LIBRARY_WITHOUT_NEGATIVES_PATH, sep='\t')
1630 | acceptedNegs.to_csv(NEGATIVE_CONTROLS_PATH,sep='\t')
1631 | ```
1632 |
1633 | # 6. Finalizing library design
1634 |
1635 | * divide genes into sublibrary groups (if required)
1636 | * assign negative control sgRNAs to sublibrary groups; ~1-2% of the number of sgRNAs in the library is a good rule-of-thumb
1637 | * append PCR adapter sequences (~18bp) to each end of the oligo sequences to enable amplification of the oligo pool; each sublibary should have an orthogonal sequence so they can be cloned separately
1638 |
1639 |
1640 | ```python
1641 |
1642 | ```
1643 |
--------------------------------------------------------------------------------