├── .gitignore
├── README.md
├── CRISPRiaDesign_example_notebook.md
├── CRISPRiaDesign_example_notebook.ipynb
├── sgRNA_learning.py
└── Library_design_walkthrough.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | #files too large for github
 7 | large_data_files/
 8 | 
 9 | # C extensions
10 | *.so
11 | 
12 | # Distribution / packaging
13 | .Python
14 | env/
15 | build/
16 | develop-eggs/
17 | dist/
18 | downloads/
19 | eggs/
20 | .eggs/
21 | lib/
22 | lib64/
23 | parts/
24 | sdist/
25 | var/
26 | *.egg-info/
27 | .installed.cfg
28 | *.egg
29 | 
30 | # PyInstaller
31 | #  Usually these files are written by a python script from a template
32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 | 
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 | 
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .coverage
44 | .coverage.*
45 | .cache
46 | nosetests.xml
47 | coverage.xml
48 | *,cover
49 | .hypothesis/
50 | 
51 | # Translations
52 | *.mo
53 | *.pot
54 | 
55 | # Django stuff:
56 | *.log
57 | local_settings.py
58 | 
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 | 
63 | # Scrapy stuff:
64 | .scrapy
65 | 
66 | # Sphinx documentation
67 | docs/_build/
68 | 
69 | # PyBuilder
70 | target/
71 | 
72 | # IPython Notebook
73 | .ipynb_checkpoints
74 | 
75 | # pyenv
76 | .python-version
77 | 
78 | # celery beat schedule file
79 | celerybeat-schedule
80 | 
81 | # dotenv
82 | .env
83 | 
84 | # virtualenv
85 | venv/
86 | ENV/
87 | 
88 | # Spyder project settings
89 | .spyderproject
90 | 
91 | # Rope project settings
92 | .ropeproject
93 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # CRISPRiaDesign
 2 | 
 3 | This site hosts the sgRNA machine learning scripts used to generate the Weissman lab's next-generation CRISPRi and CRISPRa library designs [(Horlbeck et al., eLife 2016)](https://elifesciences.org/content/5/e19760). These are currently implemented as interactive scripts along with iPython notebooks with step-by-step instructions for creating new sgRNA libraries. Future plans include adding command line functions to make library design more user-friendly. Note that all sgRNA designs for CRISPRi/a human/mouse protein-coding gene libraries are included as supplementary tables in the eLife paper, so cloning of individual sgRNAs or construction of any custom sublibraries targeting protein-coding genes can simply refer to those tables. These scripts are primarily useful for the design of sgRNAs targeting novel or non-coding genes, or for organisms beyond human and mouse.
 4 | 
 5 | **To apply the exact quantitative models used to generate the CRISPRi-v2 or CRISPRa-v2 libraries**, follow the steps outlined in the Library_design_walkthrough (included as a Jupyter notebook or [web page](Library_design_walkthrough.md)). 
 6 | 
 7 | To see full example code for de novo machine learning, prediction of sgRNA activity for desired loci, and construction of new genome-scale CRISPRi/a libraries, see the CRISPRiaDesign_example_notebook (included as Jupyter notebook or [web page](CRISPRiaDesign_example_notebook.md)).
 8 | 
 9 | ### Dependencies
10 | * Python v2.7
11 | * Jupyter notebook
12 | * Biopython
13 | * Scipy/Numpy/Pandas
14 | * Scikit-learn 
15 | * bxpython (v0.5.0, https://github.com/bxlab/bx-python)
16 | * Pysam
17 | * [ScreenProcessing](https://github.com/mhorlbeck/ScreenProcessing)
18 | 
19 | External command line applications required:
20 | * ViennaRNA
21 | * Bowtie (not Bowtie2)
22 | 
23 | Large genomic data files required:
24 | 
25 | Links are to human genome files relied upon for the hCRISPRi-v2 and hCRISPRa-v2 machine learning--and required for the Library_design_walkthrough--but any organism/assembly may be used for design of new libraries or de novo machine learning. For convenience, **the files referenced in Library_design_walkthrough in the folder "large_data_files" are also available [here](https://ucsf.box.com/s/s4ds471in2ngjer7okavzf5cqf2ebrqj)**.
26 | 
27 | * Genome sequence as FASTA ([hg19](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/))
28 | * FANTOM5 TSS annotation as BED ([TSS_human](http://fantom.gsc.riken.jp/5/datafiles/phase1.3/extra/TSS_classifier/))
29 | * Chromatin data as BigWig ([MNase](https://www.encodeproject.org/files/ENCFF000VNN/), [DNase](https://www.encodeproject.org/files/ENCFF000SVL/), [FAIRE-seq](https://www.encodeproject.org/files/ENCFF000TLU/))
30 | * HGNC table of gene aliases (not strictly required for the Library_design_walkthrough but useful in some steps)
31 | * Ensembl annotation as GTF (not strictly required for the Library_design_walkthrough but useful in some steps and in other functions; release 74 used for the published library designs)
32 | 


--------------------------------------------------------------------------------
/CRISPRiaDesign_example_notebook.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 1. Learning sgRNA predictors from empirical data
  3 |     * Load scripts and empirical data
  4 |     * Generate TSS annotation using FANTOM dataset
  5 |     * Calculate parameters for empirical sgRNAs
  6 |     * Fit parameters
  7 | 2. Applying machine learning model to predict sgRNA activity
  8 |     * Find all sgRNAs in genomic regions of interest 
  9 |     * Predicting sgRNA activity
 10 | 3. Construct sgRNA libraries
 11 |     * Score sgRNAs for off-target potential
 12 | * Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering
 13 | * Design negative controls matching the base composition of the library
 14 | * Finalizing library design
 15 | 
 16 | # 1. Learning sgRNA predictors from empirical data
 17 | ## Load scripts and empirical data
 18 | 
 19 | 
 20 | ```python
 21 | %run sgRNA_learning.py
 22 | ```
 23 | 
 24 | 
 25 | ```python
 26 | genomeDict = loadGenomeAsDict(FASTA_FILE_OF_GENOME)
 27 | gencodeData = loadGencodeData(GTF_FILE_FROM_GENCODE)
 28 | ```
 29 | 
 30 | 
 31 | ```python
 32 | #load empirical data as tables in the format generated by github.com/mhorlbeck/ScreenProcessing
 33 | libraryTable, phenotypeTable, geneTable = loadExperimentData(PATHS_TO_DATA_GENERATED_BY_ScreenProcessing)
 34 | ```
 35 | 
 36 | 
 37 | ```python
 38 | #extract genes that scored as hits, normalize phenotypes, and extract information on sgRNAs from the sgIDs
 39 | discriminantTable = calculateDiscriminantScores(geneTable)
 40 | normedScores, maxDiscriminantTable = getNormalizedsgRNAsOverThresh(libraryTable, phenotypeTable, discriminantTable, 
 41 |                                                                    DISCRIMANT_THRESHOLD_eg20,
 42 |                                                                    3, transcripts=False)
 43 | 
 44 | libraryTable_subset = libraryTable.loc[normedScores.dropna().index]
 45 | sgInfoTable = parseAllSgIds(libraryTable_subset)
 46 | ```
 47 | 
 48 | ## Generate TSS annotation using FANTOM dataset
 49 | 
 50 | 
 51 | ```python
 52 | #first generates a table of TSS annotations
 53 | #legacy function to make an intermediate table for the "P1P2" annotation strategy, will be replaced in future versions
 54 | #TSS_TABLE_BASED_ON_ENSEMBL is table without headers with columns:
 55 | #gene, transcript, chromosome, TSS coordinate, strand, annotation_source(optional)
 56 | tssTable = generateTssTable(geneTable, TSS_TABLE_BASED_ON_ENSEMBL, FANTOM_TSS_ANNOTATION_BED, 200)
 57 | ```
 58 | 
 59 | 
 60 | ```python
 61 | #Now create a TSS annotation by searching for P1 and P2 peaks near annotated TSSs
 62 | geneToAliases = generateAliasDict(HGNC_SYMBOL_LOOKUP_TABLE,gencodeData)
 63 | p1p2Table = generateTssTable_P1P2strategy(tssTable.loc[tssTable.apply(lambda row: row.name[0][:6] != 'pseudo',axis=1)],
 64 |                                           FANTOM_TSS_ANNOTATION_BED, 
 65 |                                           matchedp1p2Window = 30000, #region around supplied TSS annotation to search for a FANTOM P1 or P2 peak that matches the gene name (or alias)
 66 |                                           anyp1p2Window = 500, #region around supplied TSS annotation to search for the nearest P1 or P2 peak
 67 |                                           anyPeakWindow = 200, #region around supplied TSS annotation to search for any CAGE peak
 68 |                                           minDistanceForTwoTSS = 1000, #If a P1 and P2 peak are found, maximum distance at which to combine into a single annotation (with primary/secondary TSS positions)
 69 |                                           aliasDict = geneToAliases[0])
 70 | #the function will report some collisions of IDs due to use of aliases and redundancy in genome, but will resolve these itself
 71 | ```
 72 | 
 73 | 
 74 | ```python
 75 | #save tables for downstream use
 76 | tssTable = pd.read_csv(TSS_TABLE_PATH,sep='\t', index_col=range(2))
 77 | p1p2Table = pd.read_csv(P1P2_TABLE_PATH,sep='\t', header=0, index_col=range(2))
 78 | ```
 79 | 
 80 | ## Calculate parameters for empirical sgRNAs
 81 | 
 82 | 
 83 | ```python
 84 | #Load bigwig files for any chromatin data of interest
 85 | bwhandleDict = {'dnase':BigWigFile(open('ENCODE_data/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')),
 86 | 'faire':BigWigFile(open('ENCODE_data/wgEncodeOpenChromFaireK562Sig.bigWig')),
 87 | 'mnase':BigWigFile(open('ENCODE_data/wgEncodeSydhNsomeK562Sig.bigWig'))}
 88 | ```
 89 | 
 90 | 
 91 | ```python
 92 | paramTable_trainingGuides = generateTypicalParamTable(libraryTable_subset,sgInfoTable, tssTable, p1p2Table, genomeDict, bwhandleDict)
 93 | ```
 94 | 
 95 | ## Fit parameters
 96 | 
 97 | 
 98 | ```python
 99 | #populate table of fitting parameters
100 | typeList = ['binnable_onehot', 
101 |             'continuous', 'continuous', 'continuous', 'continuous',
102 |             'binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot',
103 |             'binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot',
104 |             'binary']
105 | typeList.extend(['binary']*160)
106 | typeList.extend(['binary']*(16*38))
107 | typeList.extend(['binnable_onehot']*3)
108 | typeList.extend(['binnable_onehot']*2)
109 | typeList.extend(['binary']*18)
110 | fitTable = pd.DataFrame(typeList, index=paramTable_trainingGuides.columns, columns=['type'])
111 | fitparams =[{'bin width':1, 'min edge data':50, 'bin function':np.median},
112 |             {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},
113 |             {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},
114 |             {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},
115 |             {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},
116 |             {'bin width':1, 'min edge data':50, 'bin function':np.median},
117 |             {'bin width':1, 'min edge data':50, 'bin function':np.median},
118 |             {'bin width':1, 'min edge data':50, 'bin function':np.median},
119 |             {'bin width':1, 'min edge data':50, 'bin function':np.median},
120 |             {'bin width':.1, 'min edge data':50, 'bin function':np.median},
121 |             {'bin width':.1, 'min edge data':50, 'bin function':np.median},
122 |             {'bin width':.1, 'min edge data':50, 'bin function':np.median},
123 |             {'bin width':.1, 'min edge data':50, 'bin function':np.median},
124 |             {'bin width':.1, 'min edge data':50, 'bin function':np.median},
125 |             {'bin width':.1, 'min edge data':50, 'bin function':np.median},
126 |             {'bin width':.1, 'min edge data':50, 'bin function':np.median},dict()]
127 | fitparams.extend([dict()]*160)
128 | fitparams.extend([dict()]*(16*38))
129 | fitparams.extend([
130 |             {'bin width':.15, 'min edge data':50, 'bin function':np.median},
131 |             {'bin width':.15, 'min edge data':50, 'bin function':np.median},
132 |             {'bin width':.15, 'min edge data':50, 'bin function':np.median}])
133 | fitparams.extend([
134 |             {'bin width':2, 'min edge data':50, 'bin function':np.median},
135 |             {'bin width':2, 'min edge data':50, 'bin function':np.median}])
136 | fitparams.extend([dict()]*18)
137 | fitTable['params'] = fitparams
138 | ```
139 | 
140 | 
141 | ```python
142 | #divide empirical data into n-folds for cross-validation
143 | geneFoldList = getGeneFolds(libraryTable_subset, 5, transcripts=False)
144 | ```
145 | 
146 | 
147 | ```python
148 | #for each fold, fit parameters to training folds and measure ROC on test fold
149 | coefs = []
150 | scoreTups = []
151 | transformedParamTups = []
152 | 
153 | for geneFold_train, geneFold_test in geneFoldList:
154 | 
155 |     transformedParams_train, estimators = fitParams(paramTable_trainingGuides.loc[normedScores.dropna().index].iloc[geneFold_train], normedScores.loc[normedScores.dropna().index].iloc[geneFold_train], fitTable)
156 | 
157 |     transformedParams_test = transformParams(paramTable_trainingGuides.loc[normedScores.dropna().index].iloc[geneFold_test], fitTable, estimators)
158 |     
159 |     reg = linear_model.ElasticNetCV(l1_ratio=[.5, .75, .9, .99,1], n_jobs=16, max_iter=2000)
160 |     
161 |     scaler = preprocessing.StandardScaler()
162 |     reg.fit(scaler.fit_transform(transformedParams_train), normedScores.loc[normedScores.dropna().index].iloc[geneFold_train])
163 |     predictedScores = pd.Series(reg.predict(scaler.transform(transformedParams_test)), index=transformedParams_test.index)
164 |     testScores = normedScores.loc[normedScores.dropna().index].iloc[geneFold_test]
165 |     
166 |     transformedParamTups.append((scaler.transform(transformedParams_train),scaler.transform(transformedParams_test)))
167 |     scoreTups.append((testScores, predictedScores))
168 |     
169 |     print 'Prediction AUC-ROC:', metrics.roc_auc_score((testScores >= .75).values, np.array(predictedScores.values,dtype='float64'))
170 |     print 'Prediction R^2:', reg.score(scaler.transform(transformedParams_test), testScores)
171 |     print 'Regression parameters:', reg.l1_ratio_, reg.alpha_
172 |     coefs.append(pd.DataFrame(zip(*[abs(reg.coef_),reg.coef_]), index = transformedParams_test.columns, columns=['abs','true']))
173 |     print 'Number of features used:', len(coefs[-1]) - sum(coefs[-1]['abs'] < .00000000001)
174 | ```
175 | 
176 | 
177 | ```python
178 | #can select an arbitrary fold (as shown here simply the last one tested) to save state for reproducing estimators later
179 | #the pickling of the scikit-learn estimators/regressors will allow the model to be reloaded for prediction of other guide designs, 
180 | #   but will not be compatible across scikit-learn versions, so it is important to preserve the training data and training/test folds
181 | import cPickle
182 | estimatorString = cPickle.dumps((fitTable, estimators, scaler, reg, (geneFold_train, geneFold_test)))
183 | with open(PICKLE_FILE,'w') as outfile:
184 |     outfile.write(estimatorString)
185 |     
186 | #also save the transformed parameters as these can slightly differ based on the automated binning strategy
187 | transformedParams_train.head().to_csv(TRANSFORMED_PARAM_HEADER,sep='\t')
188 | ```
189 | 
190 | # 2. Applying machine learning model to predict sgRNA activity
191 | 
192 | 
193 | ```python
194 | #starting from a new session for demonstration purposes:
195 | %run sgRNA_learning.py
196 | import cPickle
197 | 
198 | #load tssTable, p1p2Table, genome sequence, chromatin data
199 | tssTable = pd.read_csv(TSS_TABLE_PATH,sep='\t', index_col=range(2))
200 | 
201 | p1p2Table = pd.read_csv(P1P2_TABLE_PATH,sep='\t', header=0, index_col=range(2))
202 | p1p2Table['primary TSS'] = p1p2Table['primary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0]), int(tupString.strip('()').split(', ')[1])))
203 | p1p2Table['secondary TSS'] = p1p2Table['secondary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0]),int(tupString.strip('()').split(', ')[1])))
204 | 
205 | genomeDict = loadGenomeAsDict(FASTA_FILE_OF_GENOME)
206 | 
207 | bwhandleDict = {'dnase':BigWigFile(open('ENCODE_data/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')),
208 | 'faire':BigWigFile(open('ENCODE_data/wgEncodeOpenChromFaireK562Sig.bigWig')),
209 | 'mnase':BigWigFile(open('ENCODE_data/wgEncodeSydhNsomeK562Sig.bigWig'))}
210 | 
211 | #load sgRNA prediction model saved after the parameter fitting step
212 | with open(PICKLE_FILE) as infile:
213 |     fitTable, estimators, scaler, reg, (geneFold_train, geneFold_test) = cPickle.load(infile)
214 |     
215 | transformedParamHeader = pd.read_csv(TRANSFORMED_PARAM_HEADER,sep='\t')
216 | ```
217 | 
218 | ## Find all sgRNAs in genomic regions of interest 
219 | 
220 | 
221 | ```python
222 | #use the same p1p2Table as above or generate a new one for novel TSSs
223 | libraryTable_new, sgInfoTable_new = findAllGuides(p1p2Table, genomeDict, (-25,500))
224 | ```
225 | 
226 | 
227 | ```python
228 | #alternately, load tables of sgRNAs to score:
229 | libraryTable_new = pd.read_csv(LIBRARY_TABLE_PATH,sep='\t',index_col=0)
230 | sgInfoTable_new = pd.read_csv(SGINFO_TABLE_PATH,sep='\t',index_col=0)
231 | ```
232 | 
233 | ## Predicting sgRNA activity
234 | 
235 | 
236 | ```python
237 | #calculate parameters for new sgRNAs
238 | paramTable_new = generateTypicalParamTable(libraryTable_new, sgInfoTable_new, tssTable, p1p2Table, genomeDict, bwhandleDict)
239 | ```
240 | 
241 | 
242 | ```python
243 | #transform and predict scores according to sgRNA prediction model
244 | transformedParams_new = transformParams(paramTable_new, fitTable, estimators)
245 | 
246 | #reconcile any differences in column headers generated by automated binning
247 | colTups = []
248 | for (l1, l2), col in transformedParams_new.iteritems():
249 |     colTups.append((l1,str(l2)))
250 | transformedParams_new.columns = pd.MultiIndex.from_tuples(colTups)
251 | 
252 | predictedScores_new = pd.Series(reg.predict(scaler.transform(transformedParams_new.loc[:, transformedParamHeader.columns].fillna(0).values)), index=transformedParams_new.index)
253 | ```
254 | 
255 | 
256 | ```python
257 | predictedScores_new.to_csv(PREDICTED_SCORE_TABLE, sep='\t')
258 | ```
259 | 
260 | # 3. Construct sgRNA libraries
261 | ## Score sgRNAs for off-target potential
262 | 
263 | 
264 | ```python
265 | #There are many ways to score sgRNAs as off-target; below is one listed one method that is simple and flexible,
266 | #but ignores gapped alignments, alternate PAMs, and uses bowtie which may not be maximally sensitive in all cases
267 | ```
268 | 
269 | 
270 | ```python
271 | #output all sequences to a temporary FASTQ file for running bowtie alignment
272 | def outputTempBowtieFastq(libraryTable, outputFileName):
273 |     phredString = 'I4!=======44444+++++++' #weighting for how impactful mismatches are along sgRNA sequence 
274 |     with open(outputFileName,'w') as outfile:
275 |         for name, row in libraryTable.iterrows():
276 |             outfile.write('@' + name + '\n')
277 |             outfile.write('CCN' + str(Seq.Seq(row['sequence'][1:]).reverse_complement()) + '\n')
278 |             outfile.write('+\n')
279 |             outfile.write(phredString + '\n')
280 |             
281 | outputTempBowtieFastq(libraryTable_new, TEMP_FASTQ_FILE)
282 | ```
283 | 
284 | 
285 | ```python
286 | import subprocess
287 | fqFile = TEMP_FASTQ_FILE
288 | 
289 | #specifying a list of parameters to run bowtie with
290 | #each tuple contains
291 | # *the mismatch threshold below which a site is considered a potential off-target (higher is more stringent)
292 | # *the number of sites allowed (1 is minimum since each sgRNA should have one true site in genome)
293 | # *the genome index against which to align the sgRNA sequences; these can be custom built to only consider sites near TSSs
294 | # *a name for the bowtie run to create appropriately named output files
295 | alignmentList = [(39,1,'~/indices/hg19.ensemblTSSflank500b','39_nearTSS'),
296 |                 (31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS'),
297 |                 (21,1,'~/indices/hg19.maskChrMandPAR','21_genome'),
298 |                 (31,2,'~/indices/hg19.ensemblTSSflank500b','31_2_nearTSS'),
299 |                 (31,3,'~/indices/hg19.ensemblTSSflank500b','31_3_nearTSS')]
300 | 
301 | alignmentColumns = []
302 | for btThreshold, mflag, bowtieIndex, runname in alignmentList:
303 | 
304 |     alignedFile = 'bowtie_output/' + runname + '_aligned.txt'
305 |     unalignedFile = 'bowtie_output/' + runname + '_unaligned.fq'
306 |     maxFile = 'bowtie_output/' + runname + '_max.fq'
307 |     
308 |     bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile
309 |     print bowtieString
310 |     print subprocess.call(bowtieString, shell=True)
311 | 
312 |     #parse through the file of sgRNAs that exceeded "m", the maximum allowable alignments, and mark "True" any that are found
313 |     with open(maxFile) as infile:
314 |         sgsAligning = set()
315 |         for i, line in enumerate(infile):
316 |             if i%4 == 0: #id line
317 |                 sgsAligning.add(line.strip()[1:])
318 | 
319 |     alignmentColumns.append(libraryTable_new.apply(lambda row: row.name in sgsAligning, axis=1))
320 |     
321 | #collate results into a table, and flip the boolean values to yield the sgRNAs that passed filter as True
322 | alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3]).ne(True)
323 | ```
324 | 
325 | ## Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering
326 | 
327 | 
328 | ```python
329 | #combine all generated data into one master table
330 | predictedScores_new.name = 'predicted score'
331 | v2Table = pd.concat((libraryTable_new, predictedScores_new, alignmentTable, sgInfoTable_new), axis=1, keys=['library table v2', 'predicted score', 'off-target filters', 'sgRNA info'])
332 | ```
333 | 
334 | 
335 | ```python
336 | import re
337 | #for our pCRISPRi/a-v2 vector, we append flanking sequences to each sgRNA sequence for cloning and require the oligo to contain
338 | #exactly 1 BstXI and BlpI site each for cloning, and exactly 0 SbfI sites for sequencing sample preparation
339 | restrictionSites = {re.compile('CCA......TGG'):1,
340 |                    re.compile('GCT.AGC'):1,
341 |                    re.compile('CCTGCAGG'):0}
342 | 
343 | def matchREsites(sequence, REdict):
344 |     seq = sequence.upper()
345 |     for resite, numMatchesExpected in restrictionSites.iteritems():
346 |         if len(resite.findall(seq)) != numMatchesExpected:
347 |             return False
348 |         
349 |     return True
350 | 
351 | def checkOverlaps(leftPosition, acceptedLeftPositions, nonoverlapMin):
352 |     for pos in acceptedLeftPositions:
353 |         if abs(pos - leftPosition) < nonoverlapMin:
354 |             return False
355 |     return True
356 | ```
357 | 
358 | 
359 | ```python
360 | #flanking sequences
361 | upstreamConstant = 'CCACCTTGTTG'
362 | downstreamConstant = 'GTTTAAGAGCTAAGCTG'
363 | 
364 | #minimum overlap between two sgRNAs targeting the same TSS
365 | nonoverlapMin = 3
366 | 
367 | #number of sgRNAs to pick per gene/TSS
368 | sgRNAsToPick = 10
369 | 
370 | #list of off-target filter (or combinations of filters) levels, matching the names in the alignment table above
371 | offTargetLevels = [['31_nearTSS', '21_genome'],
372 |                   ['31_nearTSS'],
373 |                   ['21_genome'],
374 |                   ['31_2_nearTSS'],
375 |                   ['31_3_nearTSS']]
376 | 
377 | #for each gene/TSS, go through each sgRNA in descending order of predicted score
378 | #if an sgRNA passes the restriction site, overlap, and off-target filters, accept it into the library
379 | #if the number of sgRNAs accepted is less than sgRNAsToPick, reduce off-target stringency by one and continue
380 | v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')])
381 | newSgIds = []
382 | unfinishedTss = []
383 | for (gene, transcript), group in v2Groups:
384 |     geneSgIds = []
385 |     geneLeftPositions = []
386 |     empiricalSgIds = dict()
387 |     
388 |     stringency = 0
389 |     
390 |     while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels):
391 |         for sgId_v2, row in group.sort(('predicted score','predicted score'), ascending=False).iterrows():
392 |             oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant
393 |             leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)
394 |             if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \
395 |                 and matchREsites(oligoSeq, restrictionSites) \
396 |                 and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):
397 |                 geneSgIds.append((sgId_v2,
398 |                                   gene,transcript,
399 |                                   row[('library table v2','sequence')], oligoSeq,
400 |                                   row[('predicted score','predicted score')], np.nan,
401 |                                  stringency))
402 |                 geneLeftPositions.append(leftPos)
403 |                 
404 |         stringency += 1
405 |             
406 |     if len(geneSgIds) < sgRNAsToPick:
407 |         unfinishedTss.append((gene, transcript)) #if the number of accepted sgRNAs is still less than sgRNAsToPick, discard gene
408 |     else:
409 |         newSgIds.extend(geneSgIds)
410 |         
411 | libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence',
412 |  'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID')
413 | ```
414 | 
415 | 
416 | ```python
417 | #number of sgRNAs accepted at each stringency level
418 | newLibraryTable.groupby('off-target stringency').agg(len).iloc[:,0]
419 | ```
420 | 
421 | 
422 | ```python
423 | #number of TSSs with fewer than required number of sgRNAs (and thus not included in the library)
424 | print len(unfinishedTss)
425 | ```
426 | 
427 | 
428 | ```python
429 | #Note that empirical information from previous screens can be included as well--for example:
430 | geneToDisc = maxDiscriminantTable['best score'].groupby(level=0).agg(max).to_dict()
431 | thresh = 7
432 | empiricalBonus = .2
433 | 
434 | upstreamConstant = 'CCACCTTGTTG'
435 | downstreamConstant = 'GTTTAAGAGCTAAGCTG'
436 | 
437 | nonoverlapMin = 3
438 | 
439 | sgRNAsToPick = 10
440 | 
441 | offTargetLevels = [['31_nearTSS', '21_genome'],
442 |                   ['31_nearTSS'],
443 |                   ['21_genome'],
444 |                   ['31_2_nearTSS'],
445 |                   ['31_3_nearTSS']]
446 | offTargetLevels_v1 = [[s + '_v1' for s in l] for l in offTargetLevels]
447 | 
448 | v1Groups = v1Table.groupby([('relative position','gene'),('relative position','transcript')])
449 | v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')])
450 | 
451 | newSgIds = []
452 | unfinishedTss = []
453 | for (gene, transcript), group in v2Groups:
454 |     geneSgIds = []
455 |     geneLeftPositions = []
456 |     empiricalSgIds = dict()
457 |     
458 |     stringency = 0
459 |     
460 |     while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels):
461 |         
462 |         if gene in geneToDisc and geneToDisc[gene] >= thresh and (gene, transcript) in v1Groups.groups:
463 | 
464 |             for sgId_v1, row in v1Groups.get_group((gene, transcript)).sort(('Empirical activity score','Empirical activity score'),ascending=False).iterrows():
465 |                 oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant
466 |                 leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)
467 |                 if len(geneSgIds) < sgRNAsToPick and min(abs(row.loc['relative position'].iloc[2:])) < 5000 \
468 |                 and row[('Empirical activity score','Empirical activity score')] >= .75 \
469 |                 and row['off-target filters'].loc[offTargetLevels_v1[stringency]].all() \
470 |                 and matchREsites(oligoSeq, restrictionSites) \
471 |                 and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):
472 |                     if len(geneSgIds) < 2:
473 |                         geneSgIds.append((row[('library table v2','sgId_v2')],
474 |                                           gene,transcript,
475 |                                           row[('library table v2','sequence')], oligoSeq,
476 |                                           np.nan,row[('Empirical activity score','Empirical activity score')],
477 |                                          stringency))
478 |                         geneLeftPositions.append(leftPos)
479 | 
480 |                     empiricalSgIds[row[('library table v2','sgId_v2')]] = row[('Empirical activity score','Empirical activity score')]
481 | 
482 |         adjustedScores = group.apply(lambda row: row[('predicted score','CRISPRiv2 predicted score')] + empiricalBonus if row.name in empiricalSgIds else row[('predicted score','CRISPRiv2 predicted score')], axis=1)
483 |         adjustedScores.name = ('adjusted score','')
484 |         for sgId_v2, row in pd.concat((group,adjustedScores),axis=1).sort(('adjusted score',''), ascending=False).iterrows():
485 |             oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant
486 |             leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)
487 |             if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \
488 |                 and matchREsites(oligoSeq, restrictionSites) \
489 |                 and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):
490 |                 geneSgIds.append((sgId_v2,
491 |                                   gene,transcript,
492 |                                   row[('library table v2','sequence')], oligoSeq,
493 |                                   row[('predicted score','CRISPRiv2 predicted score')], empiricalSgIds[sgId_v2] if sgId_v2 in empiricalSgIds else np.nan,
494 |                                  stringency))
495 |                 geneLeftPositions.append(leftPos)
496 |                 
497 |         stringency += 1
498 |             
499 |     if len(geneSgIds) < sgRNAsToPick:
500 |         unfinishedTss.append((gene, transcript))
501 |     else:
502 |         newSgIds.extend(geneSgIds)
503 | 
504 |     
505 | libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence',
506 |  'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID')
507 | ```
508 | 
509 | ## Design negative controls matching the base composition of the library
510 | 
511 | 
512 | ```python
513 | #calcluate the base frequency at each position of the sgRNA, then generate random sequences weighted by this frequency
514 | def getBaseFrequencies(libraryTable, baseConversion = {'G':0, 'C':1, 'T':2, 'A':3}):
515 |     baseArray = np.zeros((len(libraryTable),20))
516 | 
517 |     for i, (index, seq) in enumerate(libraryTable['protospacer sequence'].iteritems()):
518 |         for j, char in enumerate(seq.upper()):
519 |             baseArray[i,j] = baseConversion[char]
520 | 
521 |     baseTable = pd.DataFrame(baseArray, index = libraryTable.index)
522 |     
523 |     baseFrequencies = baseTable.apply(lambda col: col.groupby(col).agg(len)).fillna(0) / len(baseTable)
524 |     baseFrequencies.index = ['G','C','T','A']
525 |     
526 |     baseCumulativeFrequencies = baseFrequencies.copy()
527 |     baseCumulativeFrequencies.loc['C'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C']
528 |     baseCumulativeFrequencies.loc['T'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T']
529 |     baseCumulativeFrequencies.loc['A'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T'] + baseFrequencies.loc['A']
530 | 
531 |     return baseFrequencies, baseCumulativeFrequencies
532 | 
533 | def generateRandomSequence(baseCumulativeFrequencies):
534 |     randArray = np.random.random(baseCumulativeFrequencies.shape[1])
535 |     
536 |     seq = []
537 |     for i, col in baseCumulativeFrequencies.iteritems():
538 |         for base, freq in col.iteritems():
539 |             if randArray[i] < freq:
540 |                 seq.append(base)
541 |                 break
542 |                 
543 |     return ''.join(seq)
544 | ```
545 | 
546 | 
547 | ```python
548 | baseCumulativeFrequencies = getBaseFrequencies(libraryTable_complete)[1]
549 | negList = []
550 | for i in range(30000):
551 |     negList.append(generateRandomSequence(baseCumulativeFrequencies))
552 | negTable = pd.DataFrame(negList, index=['non-targeting_' + str(i) for i in range(30000)], columns = ['sequence'])
553 | 
554 | outputTempBowtieFastq(negTable, TEMP_FASTQ_FILE)
555 | ```
556 | 
557 | 
558 | ```python
559 | #similar to targeting sgRNA off-target scoring, but looking for sgRNAs with 0 alignments
560 | fqFile = TEMP_FASTQ_FILE
561 | 
562 | alignmentList = [(31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS_negs'),
563 |                 (21,1,'~/indices/hg19.maskChrMandPAR','21_genome_negs')]
564 | 
565 | alignmentColumns = []
566 | for btThreshold, mflag, bowtieIndex, runname in alignmentList:
567 | 
568 |     alignedFile = 'bowtie_output/' + runname + '_aligned.txt'
569 |     unalignedFile = 'bowtie_output//' + runname + '_unaligned.fq'
570 |     maxFile = 'bowtie_output/' + runname + '_max.fq'
571 |     
572 |     bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile
573 |     print bowtieString
574 |     print subprocess.call(bowtieString, shell=True)
575 | 
576 |     #read unaligned file for negs, and then don't flip boolean of alignmentTable
577 |     with open(unalignedFile) as infile:
578 |         sgsAligning = set()
579 |         for i, line in enumerate(infile):
580 |             if i%4 == 0: #id line
581 |                 sgsAligning.add(line.strip()[1:])
582 | 
583 |     alignmentColumns.append(negTable.apply(lambda row: row.name in sgsAligning, axis=1))
584 |     
585 | alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3])
586 | alignmentTable.head()
587 | ```
588 | 
589 | 
590 | ```python
591 | acceptedNegList = []
592 | negCount = 0
593 | for i, (name, row) in enumerate(pd.concat((negTable,alignmentTable),axis=1, keys=['seq','alignment']).iterrows()):
594 |     oligo = upstreamConstant + row['seq','sequence'] + downstreamConstant
595 |     if row['alignment'].all() and matchREsites(oligo, restrictionSites):
596 |         acceptedNegList.append(('non-targeting_%05d' % negCount, 'negative_control', 'na', row['seq','sequence'], oligo, 0))
597 |         negCount += 1
598 |         
599 | acceptedNegs = pd.DataFrame(acceptedNegList, columns = ['sgId', 'gene', 'transcript', 'protospacer sequence', 'oligo sequence', 'off-target stringency']).set_index('sgId')
600 | ```
601 | 
602 | ## Finalizing library design
603 | 
604 | * divide genes into sublibrary groups (if required)
605 | * assign negative control sgRNAs to sublibrary groups; ~1-2% of the number of sgRNAs in the library is a good rule-of-thumb
606 | * append PCR adapter sequences (~18bp) to each end of the oligo sequences to enable amplification of the oligo pool; each sublibary should have an orthogonal sequence so they can be cloned separately
607 | 
608 | 
609 | ```python
610 | 
611 | ```
612 | 


--------------------------------------------------------------------------------
/CRISPRiaDesign_example_notebook.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "1. Learning sgRNA predictors from empirical data\n",
  8 |     "    * Load scripts and empirical data\n",
  9 |     "    * Generate TSS annotation using FANTOM dataset\n",
 10 |     "    * Calculate parameters for empirical sgRNAs\n",
 11 |     "    * Fit parameters\n",
 12 |     "2. Applying machine learning model to predict sgRNA activity\n",
 13 |     "    * Find all sgRNAs in genomic regions of interest \n",
 14 |     "    * Predicting sgRNA activity\n",
 15 |     "3. Construct sgRNA libraries\n",
 16 |     "    * Score sgRNAs for off-target potential\n",
 17 |     "* Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering\n",
 18 |     "* Design negative controls matching the base composition of the library\n",
 19 |     "* Finalizing library design"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "# 1. Learning sgRNA predictors from empirical data\n",
 27 |     "## Load scripts and empirical data"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": null,
 33 |    "metadata": {
 34 |     "collapsed": true
 35 |    },
 36 |    "outputs": [],
 37 |    "source": [
 38 |     "%run sgRNA_learning.py"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": null,
 44 |    "metadata": {
 45 |     "collapsed": true
 46 |    },
 47 |    "outputs": [],
 48 |    "source": [
 49 |     "genomeDict = loadGenomeAsDict(FASTA_FILE_OF_GENOME)\n",
 50 |     "gencodeData = loadGencodeData(GTF_FILE_FROM_GENCODE)"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": null,
 56 |    "metadata": {
 57 |     "collapsed": true
 58 |    },
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "#load empirical data as tables in the format generated by github.com/mhorlbeck/ScreenProcessing\n",
 62 |     "libraryTable, phenotypeTable, geneTable = loadExperimentData(PATHS_TO_DATA_GENERATED_BY_ScreenProcessing)"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": null,
 68 |    "metadata": {
 69 |     "collapsed": true
 70 |    },
 71 |    "outputs": [],
 72 |    "source": [
 73 |     "#extract genes that scored as hits, normalize phenotypes, and extract information on sgRNAs from the sgIDs\n",
 74 |     "discriminantTable = calculateDiscriminantScores(geneTable)\n",
 75 |     "normedScores, maxDiscriminantTable = getNormalizedsgRNAsOverThresh(libraryTable, phenotypeTable, discriminantTable, \n",
 76 |     "                                                                   DISCRIMANT_THRESHOLD_eg20,\n",
 77 |     "                                                                   3, transcripts=False)\n",
 78 |     "\n",
 79 |     "libraryTable_subset = libraryTable.loc[normedScores.dropna().index]\n",
 80 |     "sgInfoTable = parseAllSgIds(libraryTable_subset)"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "## Generate TSS annotation using FANTOM dataset"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": null,
 93 |    "metadata": {
 94 |     "collapsed": true
 95 |    },
 96 |    "outputs": [],
 97 |    "source": [
 98 |     "#first generates a table of TSS annotations\n",
 99 |     "#legacy function to make an intermediate table for the \"P1P2\" annotation strategy, will be replaced in future versions\n",
100 |     "#TSS_TABLE_BASED_ON_ENSEMBL is table without headers with columns:\n",
101 |     "#gene, transcript, chromosome, TSS coordinate, strand, annotation_source(optional)\n",
102 |     "tssTable = generateTssTable(geneTable, TSS_TABLE_BASED_ON_ENSEMBL, FANTOM_TSS_ANNOTATION_BED, 200)"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": null,
108 |    "metadata": {
109 |     "collapsed": true
110 |    },
111 |    "outputs": [],
112 |    "source": [
113 |     "#Now create a TSS annotation by searching for P1 and P2 peaks near annotated TSSs\n",
114 |     "geneToAliases = generateAliasDict(HGNC_SYMBOL_LOOKUP_TABLE,gencodeData)\n",
115 |     "p1p2Table = generateTssTable_P1P2strategy(tssTable.loc[tssTable.apply(lambda row: row.name[0][:6] != 'pseudo',axis=1)],\n",
116 |     "                                          FANTOM_TSS_ANNOTATION_BED, \n",
117 |     "                                          matchedp1p2Window = 30000, #region around supplied TSS annotation to search for a FANTOM P1 or P2 peak that matches the gene name (or alias)\n",
118 |     "                                          anyp1p2Window = 500, #region around supplied TSS annotation to search for the nearest P1 or P2 peak\n",
119 |     "                                          anyPeakWindow = 200, #region around supplied TSS annotation to search for any CAGE peak\n",
120 |     "                                          minDistanceForTwoTSS = 1000, #If a P1 and P2 peak are found, maximum distance at which to combine into a single annotation (with primary/secondary TSS positions)\n",
121 |     "                                          aliasDict = geneToAliases[0])\n",
122 |     "#the function will report some collisions of IDs due to use of aliases and redundancy in genome, but will resolve these itself"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": null,
128 |    "metadata": {
129 |     "collapsed": true
130 |    },
131 |    "outputs": [],
132 |    "source": [
133 |     "#save tables for downstream use\n",
134 |     "tssTable = pd.read_csv(TSS_TABLE_PATH,sep='\\t', index_col=range(2))\n",
135 |     "p1p2Table = pd.read_csv(P1P2_TABLE_PATH,sep='\\t', header=0, index_col=range(2))"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "markdown",
140 |    "metadata": {
141 |     "collapsed": true
142 |    },
143 |    "source": [
144 |     "## Calculate parameters for empirical sgRNAs"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": null,
150 |    "metadata": {
151 |     "collapsed": true
152 |    },
153 |    "outputs": [],
154 |    "source": [
155 |     "#Load bigwig files for any chromatin data of interest\n",
156 |     "bwhandleDict = {'dnase':BigWigFile(open('ENCODE_data/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')),\n",
157 |     "'faire':BigWigFile(open('ENCODE_data/wgEncodeOpenChromFaireK562Sig.bigWig')),\n",
158 |     "'mnase':BigWigFile(open('ENCODE_data/wgEncodeSydhNsomeK562Sig.bigWig'))}"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "code",
163 |    "execution_count": null,
164 |    "metadata": {
165 |     "collapsed": true
166 |    },
167 |    "outputs": [],
168 |    "source": [
169 |     "paramTable_trainingGuides = generateTypicalParamTable(libraryTable_subset,sgInfoTable, tssTable, p1p2Table, genomeDict, bwhandleDict)"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "markdown",
174 |    "metadata": {},
175 |    "source": [
176 |     "## Fit parameters"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "code",
181 |    "execution_count": null,
182 |    "metadata": {
183 |     "collapsed": true
184 |    },
185 |    "outputs": [],
186 |    "source": [
187 |     "#populate table of fitting parameters\n",
188 |     "typeList = ['binnable_onehot', \n",
189 |     "            'continuous', 'continuous', 'continuous', 'continuous',\n",
190 |     "            'binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot',\n",
191 |     "            'binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot','binnable_onehot',\n",
192 |     "            'binary']\n",
193 |     "typeList.extend(['binary']*160)\n",
194 |     "typeList.extend(['binary']*(16*38))\n",
195 |     "typeList.extend(['binnable_onehot']*3)\n",
196 |     "typeList.extend(['binnable_onehot']*2)\n",
197 |     "typeList.extend(['binary']*18)\n",
198 |     "fitTable = pd.DataFrame(typeList, index=paramTable_trainingGuides.columns, columns=['type'])\n",
199 |     "fitparams =[{'bin width':1, 'min edge data':50, 'bin function':np.median},\n",
200 |     "            {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},\n",
201 |     "            {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},\n",
202 |     "            {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},\n",
203 |     "            {'C':[.01,.05, .1,.5], 'gamma':[.000001, .00005,.0001,.0005]},\n",
204 |     "            {'bin width':1, 'min edge data':50, 'bin function':np.median},\n",
205 |     "            {'bin width':1, 'min edge data':50, 'bin function':np.median},\n",
206 |     "            {'bin width':1, 'min edge data':50, 'bin function':np.median},\n",
207 |     "            {'bin width':1, 'min edge data':50, 'bin function':np.median},\n",
208 |     "            {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n",
209 |     "            {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n",
210 |     "            {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n",
211 |     "            {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n",
212 |     "            {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n",
213 |     "            {'bin width':.1, 'min edge data':50, 'bin function':np.median},\n",
214 |     "            {'bin width':.1, 'min edge data':50, 'bin function':np.median},dict()]\n",
215 |     "fitparams.extend([dict()]*160)\n",
216 |     "fitparams.extend([dict()]*(16*38))\n",
217 |     "fitparams.extend([\n",
218 |     "            {'bin width':.15, 'min edge data':50, 'bin function':np.median},\n",
219 |     "            {'bin width':.15, 'min edge data':50, 'bin function':np.median},\n",
220 |     "            {'bin width':.15, 'min edge data':50, 'bin function':np.median}])\n",
221 |     "fitparams.extend([\n",
222 |     "            {'bin width':2, 'min edge data':50, 'bin function':np.median},\n",
223 |     "            {'bin width':2, 'min edge data':50, 'bin function':np.median}])\n",
224 |     "fitparams.extend([dict()]*18)\n",
225 |     "fitTable['params'] = fitparams"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "metadata": {
232 |     "collapsed": true
233 |    },
234 |    "outputs": [],
235 |    "source": [
236 |     "#divide empirical data into n-folds for cross-validation\n",
237 |     "geneFoldList = getGeneFolds(libraryTable_subset, 5, transcripts=False)"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "code",
242 |    "execution_count": null,
243 |    "metadata": {
244 |     "collapsed": true
245 |    },
246 |    "outputs": [],
247 |    "source": [
248 |     "#for each fold, fit parameters to training folds and measure ROC on test fold\n",
249 |     "coefs = []\n",
250 |     "scoreTups = []\n",
251 |     "transformedParamTups = []\n",
252 |     "\n",
253 |     "for geneFold_train, geneFold_test in geneFoldList:\n",
254 |     "\n",
255 |     "    transformedParams_train, estimators = fitParams(paramTable_trainingGuides.loc[normedScores.dropna().index].iloc[geneFold_train], normedScores.loc[normedScores.dropna().index].iloc[geneFold_train], fitTable)\n",
256 |     "\n",
257 |     "    transformedParams_test = transformParams(paramTable_trainingGuides.loc[normedScores.dropna().index].iloc[geneFold_test], fitTable, estimators)\n",
258 |     "    \n",
259 |     "    reg = linear_model.ElasticNetCV(l1_ratio=[.5, .75, .9, .99,1], n_jobs=16, max_iter=2000)\n",
260 |     "    \n",
261 |     "    scaler = preprocessing.StandardScaler()\n",
262 |     "    reg.fit(scaler.fit_transform(transformedParams_train), normedScores.loc[normedScores.dropna().index].iloc[geneFold_train])\n",
263 |     "    predictedScores = pd.Series(reg.predict(scaler.transform(transformedParams_test)), index=transformedParams_test.index)\n",
264 |     "    testScores = normedScores.loc[normedScores.dropna().index].iloc[geneFold_test]\n",
265 |     "    \n",
266 |     "    transformedParamTups.append((scaler.transform(transformedParams_train),scaler.transform(transformedParams_test)))\n",
267 |     "    scoreTups.append((testScores, predictedScores))\n",
268 |     "    \n",
269 |     "    print 'Prediction AUC-ROC:', metrics.roc_auc_score((testScores >= .75).values, np.array(predictedScores.values,dtype='float64'))\n",
270 |     "    print 'Prediction R^2:', reg.score(scaler.transform(transformedParams_test), testScores)\n",
271 |     "    print 'Regression parameters:', reg.l1_ratio_, reg.alpha_\n",
272 |     "    coefs.append(pd.DataFrame(zip(*[abs(reg.coef_),reg.coef_]), index = transformedParams_test.columns, columns=['abs','true']))\n",
273 |     "    print 'Number of features used:', len(coefs[-1]) - sum(coefs[-1]['abs'] < .00000000001)"
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "code",
278 |    "execution_count": null,
279 |    "metadata": {
280 |     "collapsed": true
281 |    },
282 |    "outputs": [],
283 |    "source": [
284 |     "#can select an arbitrary fold (as shown here simply the last one tested) to save state for reproducing estimators later\n",
285 |     "#the pickling of the scikit-learn estimators/regressors will allow the model to be reloaded for prediction of other guide designs, \n",
286 |     "#   but will not be compatible across scikit-learn versions, so it is important to preserve the training data and training/test folds\n",
287 |     "import cPickle\n",
288 |     "estimatorString = cPickle.dumps((fitTable, estimators, scaler, reg, (geneFold_train, geneFold_test)))\n",
289 |     "with open(PICKLE_FILE,'w') as outfile:\n",
290 |     "    outfile.write(estimatorString)\n",
291 |     "    \n",
292 |     "#also save the transformed parameters as these can slightly differ based on the automated binning strategy\n",
293 |     "transformedParams_train.head().to_csv(TRANSFORMED_PARAM_HEADER,sep='\\t')"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "markdown",
298 |    "metadata": {
299 |     "collapsed": true
300 |    },
301 |    "source": [
302 |     "# 2. Applying machine learning model to predict sgRNA activity"
303 |    ]
304 |   },
305 |   {
306 |    "cell_type": "code",
307 |    "execution_count": null,
308 |    "metadata": {
309 |     "collapsed": true
310 |    },
311 |    "outputs": [],
312 |    "source": [
313 |     "#starting from a new session for demonstration purposes:\n",
314 |     "%run sgRNA_learning.py\n",
315 |     "import cPickle\n",
316 |     "\n",
317 |     "#load tssTable, p1p2Table, genome sequence, chromatin data\n",
318 |     "tssTable = pd.read_csv(TSS_TABLE_PATH,sep='\\t', index_col=range(2))\n",
319 |     "\n",
320 |     "p1p2Table = pd.read_csv(P1P2_TABLE_PATH,sep='\\t', header=0, index_col=range(2))\n",
321 |     "p1p2Table['primary TSS'] = p1p2Table['primary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0]), int(tupString.strip('()').split(', ')[1])))\n",
322 |     "p1p2Table['secondary TSS'] = p1p2Table['secondary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0]),int(tupString.strip('()').split(', ')[1])))\n",
323 |     "\n",
324 |     "genomeDict = loadGenomeAsDict(FASTA_FILE_OF_GENOME)\n",
325 |     "\n",
326 |     "bwhandleDict = {'dnase':BigWigFile(open('ENCODE_data/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')),\n",
327 |     "'faire':BigWigFile(open('ENCODE_data/wgEncodeOpenChromFaireK562Sig.bigWig')),\n",
328 |     "'mnase':BigWigFile(open('ENCODE_data/wgEncodeSydhNsomeK562Sig.bigWig'))}\n",
329 |     "\n",
330 |     "#load sgRNA prediction model saved after the parameter fitting step\n",
331 |     "with open(PICKLE_FILE) as infile:\n",
332 |     "    fitTable, estimators, scaler, reg, (geneFold_train, geneFold_test) = cPickle.load(infile)\n",
333 |     "    \n",
334 |     "transformedParamHeader = pd.read_csv(TRANSFORMED_PARAM_HEADER,sep='\\t')"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "markdown",
339 |    "metadata": {},
340 |    "source": [
341 |     "## Find all sgRNAs in genomic regions of interest "
342 |    ]
343 |   },
344 |   {
345 |    "cell_type": "code",
346 |    "execution_count": null,
347 |    "metadata": {
348 |     "collapsed": true
349 |    },
350 |    "outputs": [],
351 |    "source": [
352 |     "#use the same p1p2Table as above or generate a new one for novel TSSs\n",
353 |     "libraryTable_new, sgInfoTable_new = findAllGuides(p1p2Table, genomeDict, (-25,500))"
354 |    ]
355 |   },
356 |   {
357 |    "cell_type": "code",
358 |    "execution_count": null,
359 |    "metadata": {
360 |     "collapsed": true
361 |    },
362 |    "outputs": [],
363 |    "source": [
364 |     "#alternately, load tables of sgRNAs to score:\n",
365 |     "libraryTable_new = pd.read_csv(LIBRARY_TABLE_PATH,sep='\\t',index_col=0)\n",
366 |     "sgInfoTable_new = pd.read_csv(SGINFO_TABLE_PATH,sep='\\t',index_col=0)"
367 |    ]
368 |   },
369 |   {
370 |    "cell_type": "markdown",
371 |    "metadata": {},
372 |    "source": [
373 |     "## Predicting sgRNA activity"
374 |    ]
375 |   },
376 |   {
377 |    "cell_type": "code",
378 |    "execution_count": null,
379 |    "metadata": {
380 |     "collapsed": true
381 |    },
382 |    "outputs": [],
383 |    "source": [
384 |     "#calculate parameters for new sgRNAs\n",
385 |     "paramTable_new = generateTypicalParamTable(libraryTable_new, sgInfoTable_new, tssTable, p1p2Table, genomeDict, bwhandleDict)"
386 |    ]
387 |   },
388 |   {
389 |    "cell_type": "code",
390 |    "execution_count": null,
391 |    "metadata": {
392 |     "collapsed": true
393 |    },
394 |    "outputs": [],
395 |    "source": [
396 |     "#transform and predict scores according to sgRNA prediction model\n",
397 |     "transformedParams_new = transformParams(paramTable_new, fitTable, estimators)\n",
398 |     "\n",
399 |     "#reconcile any differences in column headers generated by automated binning\n",
400 |     "colTups = []\n",
401 |     "for (l1, l2), col in transformedParams_new.iteritems():\n",
402 |     "    colTups.append((l1,str(l2)))\n",
403 |     "transformedParams_new.columns = pd.MultiIndex.from_tuples(colTups)\n",
404 |     "\n",
405 |     "predictedScores_new = pd.Series(reg.predict(scaler.transform(transformedParams_new.loc[:, transformedParamHeader.columns].fillna(0).values)), index=transformedParams_new.index)"
406 |    ]
407 |   },
408 |   {
409 |    "cell_type": "code",
410 |    "execution_count": null,
411 |    "metadata": {
412 |     "collapsed": true
413 |    },
414 |    "outputs": [],
415 |    "source": [
416 |     "predictedScores_new.to_csv(PREDICTED_SCORE_TABLE, sep='\\t')"
417 |    ]
418 |   },
419 |   {
420 |    "cell_type": "markdown",
421 |    "metadata": {},
422 |    "source": [
423 |     "# 3. Construct sgRNA libraries\n",
424 |     "## Score sgRNAs for off-target potential"
425 |    ]
426 |   },
427 |   {
428 |    "cell_type": "code",
429 |    "execution_count": null,
430 |    "metadata": {
431 |     "collapsed": true
432 |    },
433 |    "outputs": [],
434 |    "source": [
435 |     "#There are many ways to score sgRNAs as off-target; below is one listed one method that is simple and flexible,\n",
436 |     "#but ignores gapped alignments, alternate PAMs, and uses bowtie which may not be maximally sensitive in all cases"
437 |    ]
438 |   },
439 |   {
440 |    "cell_type": "code",
441 |    "execution_count": null,
442 |    "metadata": {
443 |     "collapsed": true
444 |    },
445 |    "outputs": [],
446 |    "source": [
447 |     "#output all sequences to a temporary FASTQ file for running bowtie alignment\n",
448 |     "def outputTempBowtieFastq(libraryTable, outputFileName):\n",
449 |     "    phredString = 'I4!=======44444+++++++' #weighting for how impactful mismatches are along sgRNA sequence \n",
450 |     "    with open(outputFileName,'w') as outfile:\n",
451 |     "        for name, row in libraryTable.iterrows():\n",
452 |     "            outfile.write('@' + name + '\\n')\n",
453 |     "            outfile.write('CCN' + str(Seq.Seq(row['sequence'][1:]).reverse_complement()) + '\\n')\n",
454 |     "            outfile.write('+\\n')\n",
455 |     "            outfile.write(phredString + '\\n')\n",
456 |     "            \n",
457 |     "outputTempBowtieFastq(libraryTable_new, TEMP_FASTQ_FILE)"
458 |    ]
459 |   },
460 |   {
461 |    "cell_type": "code",
462 |    "execution_count": null,
463 |    "metadata": {
464 |     "collapsed": true
465 |    },
466 |    "outputs": [],
467 |    "source": [
468 |     "import subprocess\n",
469 |     "fqFile = TEMP_FASTQ_FILE\n",
470 |     "\n",
471 |     "#specifying a list of parameters to run bowtie with\n",
472 |     "#each tuple contains\n",
473 |     "# *the mismatch threshold below which a site is considered a potential off-target (higher is more stringent)\n",
474 |     "# *the number of sites allowed (1 is minimum since each sgRNA should have one true site in genome)\n",
475 |     "# *the genome index against which to align the sgRNA sequences; these can be custom built to only consider sites near TSSs\n",
476 |     "# *a name for the bowtie run to create appropriately named output files\n",
477 |     "alignmentList = [(39,1,'~/indices/hg19.ensemblTSSflank500b','39_nearTSS'),\n",
478 |     "                (31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS'),\n",
479 |     "                (21,1,'~/indices/hg19.maskChrMandPAR','21_genome'),\n",
480 |     "                (31,2,'~/indices/hg19.ensemblTSSflank500b','31_2_nearTSS'),\n",
481 |     "                (31,3,'~/indices/hg19.ensemblTSSflank500b','31_3_nearTSS')]\n",
482 |     "\n",
483 |     "alignmentColumns = []\n",
484 |     "for btThreshold, mflag, bowtieIndex, runname in alignmentList:\n",
485 |     "\n",
486 |     "    alignedFile = 'bowtie_output/' + runname + '_aligned.txt'\n",
487 |     "    unalignedFile = 'bowtie_output/' + runname + '_unaligned.fq'\n",
488 |     "    maxFile = 'bowtie_output/' + runname + '_max.fq'\n",
489 |     "    \n",
490 |     "    bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile\n",
491 |     "    print bowtieString\n",
492 |     "    print subprocess.call(bowtieString, shell=True)\n",
493 |     "\n",
494 |     "    #parse through the file of sgRNAs that exceeded \"m\", the maximum allowable alignments, and mark \"True\" any that are found\n",
495 |     "    with open(maxFile) as infile:\n",
496 |     "        sgsAligning = set()\n",
497 |     "        for i, line in enumerate(infile):\n",
498 |     "            if i%4 == 0: #id line\n",
499 |     "                sgsAligning.add(line.strip()[1:])\n",
500 |     "\n",
501 |     "    alignmentColumns.append(libraryTable_new.apply(lambda row: row.name in sgsAligning, axis=1))\n",
502 |     "    \n",
503 |     "#collate results into a table, and flip the boolean values to yield the sgRNAs that passed filter as True\n",
504 |     "alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3]).ne(True)"
505 |    ]
506 |   },
507 |   {
508 |    "cell_type": "markdown",
509 |    "metadata": {},
510 |    "source": [
511 |     "## Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering"
512 |    ]
513 |   },
514 |   {
515 |    "cell_type": "code",
516 |    "execution_count": null,
517 |    "metadata": {
518 |     "collapsed": true
519 |    },
520 |    "outputs": [],
521 |    "source": [
522 |     "#combine all generated data into one master table\n",
523 |     "predictedScores_new.name = 'predicted score'\n",
524 |     "v2Table = pd.concat((libraryTable_new, predictedScores_new, alignmentTable, sgInfoTable_new), axis=1, keys=['library table v2', 'predicted score', 'off-target filters', 'sgRNA info'])"
525 |    ]
526 |   },
527 |   {
528 |    "cell_type": "code",
529 |    "execution_count": null,
530 |    "metadata": {
531 |     "collapsed": true
532 |    },
533 |    "outputs": [],
534 |    "source": [
535 |     "import re\n",
536 |     "#for our pCRISPRi/a-v2 vector, we append flanking sequences to each sgRNA sequence for cloning and require the oligo to contain\n",
537 |     "#exactly 1 BstXI and BlpI site each for cloning, and exactly 0 SbfI sites for sequencing sample preparation\n",
538 |     "restrictionSites = {re.compile('CCA......TGG'):1,\n",
539 |     "                   re.compile('GCT.AGC'):1,\n",
540 |     "                   re.compile('CCTGCAGG'):0}\n",
541 |     "\n",
542 |     "def matchREsites(sequence, REdict):\n",
543 |     "    seq = sequence.upper()\n",
544 |     "    for resite, numMatchesExpected in restrictionSites.iteritems():\n",
545 |     "        if len(resite.findall(seq)) != numMatchesExpected:\n",
546 |     "            return False\n",
547 |     "        \n",
548 |     "    return True\n",
549 |     "\n",
550 |     "def checkOverlaps(leftPosition, acceptedLeftPositions, nonoverlapMin):\n",
551 |     "    for pos in acceptedLeftPositions:\n",
552 |     "        if abs(pos - leftPosition) < nonoverlapMin:\n",
553 |     "            return False\n",
554 |     "    return True"
555 |    ]
556 |   },
557 |   {
558 |    "cell_type": "code",
559 |    "execution_count": null,
560 |    "metadata": {
561 |     "collapsed": true
562 |    },
563 |    "outputs": [],
564 |    "source": [
565 |     "#flanking sequences\n",
566 |     "upstreamConstant = 'CCACCTTGTTG'\n",
567 |     "downstreamConstant = 'GTTTAAGAGCTAAGCTG'\n",
568 |     "\n",
569 |     "#minimum overlap between two sgRNAs targeting the same TSS\n",
570 |     "nonoverlapMin = 3\n",
571 |     "\n",
572 |     "#number of sgRNAs to pick per gene/TSS\n",
573 |     "sgRNAsToPick = 10\n",
574 |     "\n",
575 |     "#list of off-target filter (or combinations of filters) levels, matching the names in the alignment table above\n",
576 |     "offTargetLevels = [['31_nearTSS', '21_genome'],\n",
577 |     "                  ['31_nearTSS'],\n",
578 |     "                  ['21_genome'],\n",
579 |     "                  ['31_2_nearTSS'],\n",
580 |     "                  ['31_3_nearTSS']]\n",
581 |     "\n",
582 |     "#for each gene/TSS, go through each sgRNA in descending order of predicted score\n",
583 |     "#if an sgRNA passes the restriction site, overlap, and off-target filters, accept it into the library\n",
584 |     "#if the number of sgRNAs accepted is less than sgRNAsToPick, reduce off-target stringency by one and continue\n",
585 |     "v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')])\n",
586 |     "newSgIds = []\n",
587 |     "unfinishedTss = []\n",
588 |     "for (gene, transcript), group in v2Groups:\n",
589 |     "    geneSgIds = []\n",
590 |     "    geneLeftPositions = []\n",
591 |     "    empiricalSgIds = dict()\n",
592 |     "    \n",
593 |     "    stringency = 0\n",
594 |     "    \n",
595 |     "    while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels):\n",
596 |     "        for sgId_v2, row in group.sort(('predicted score','predicted score'), ascending=False).iterrows():\n",
597 |     "            oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant\n",
598 |     "            leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)\n",
599 |     "            if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \\\n",
600 |     "                and matchREsites(oligoSeq, restrictionSites) \\\n",
601 |     "                and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):\n",
602 |     "                geneSgIds.append((sgId_v2,\n",
603 |     "                                  gene,transcript,\n",
604 |     "                                  row[('library table v2','sequence')], oligoSeq,\n",
605 |     "                                  row[('predicted score','predicted score')], np.nan,\n",
606 |     "                                 stringency))\n",
607 |     "                geneLeftPositions.append(leftPos)\n",
608 |     "                \n",
609 |     "        stringency += 1\n",
610 |     "            \n",
611 |     "    if len(geneSgIds) < sgRNAsToPick:\n",
612 |     "        unfinishedTss.append((gene, transcript)) #if the number of accepted sgRNAs is still less than sgRNAsToPick, discard gene\n",
613 |     "    else:\n",
614 |     "        newSgIds.extend(geneSgIds)\n",
615 |     "        \n",
616 |     "libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence',\n",
617 |     " 'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID')"
618 |    ]
619 |   },
620 |   {
621 |    "cell_type": "code",
622 |    "execution_count": null,
623 |    "metadata": {
624 |     "collapsed": true
625 |    },
626 |    "outputs": [],
627 |    "source": [
628 |     "#number of sgRNAs accepted at each stringency level\n",
629 |     "newLibraryTable.groupby('off-target stringency').agg(len).iloc[:,0]"
630 |    ]
631 |   },
632 |   {
633 |    "cell_type": "code",
634 |    "execution_count": null,
635 |    "metadata": {
636 |     "collapsed": true
637 |    },
638 |    "outputs": [],
639 |    "source": [
640 |     "#number of TSSs with fewer than required number of sgRNAs (and thus not included in the library)\n",
641 |     "print len(unfinishedTss)"
642 |    ]
643 |   },
644 |   {
645 |    "cell_type": "code",
646 |    "execution_count": null,
647 |    "metadata": {
648 |     "collapsed": true
649 |    },
650 |    "outputs": [],
651 |    "source": [
652 |     "#Note that empirical information from previous screens can be included as well--for example:\n",
653 |     "geneToDisc = maxDiscriminantTable['best score'].groupby(level=0).agg(max).to_dict()\n",
654 |     "thresh = 7\n",
655 |     "empiricalBonus = .2\n",
656 |     "\n",
657 |     "upstreamConstant = 'CCACCTTGTTG'\n",
658 |     "downstreamConstant = 'GTTTAAGAGCTAAGCTG'\n",
659 |     "\n",
660 |     "nonoverlapMin = 3\n",
661 |     "\n",
662 |     "sgRNAsToPick = 10\n",
663 |     "\n",
664 |     "offTargetLevels = [['31_nearTSS', '21_genome'],\n",
665 |     "                  ['31_nearTSS'],\n",
666 |     "                  ['21_genome'],\n",
667 |     "                  ['31_2_nearTSS'],\n",
668 |     "                  ['31_3_nearTSS']]\n",
669 |     "offTargetLevels_v1 = [[s + '_v1' for s in l] for l in offTargetLevels]\n",
670 |     "\n",
671 |     "v1Groups = v1Table.groupby([('relative position','gene'),('relative position','transcript')])\n",
672 |     "v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')])\n",
673 |     "\n",
674 |     "newSgIds = []\n",
675 |     "unfinishedTss = []\n",
676 |     "for (gene, transcript), group in v2Groups:\n",
677 |     "    geneSgIds = []\n",
678 |     "    geneLeftPositions = []\n",
679 |     "    empiricalSgIds = dict()\n",
680 |     "    \n",
681 |     "    stringency = 0\n",
682 |     "    \n",
683 |     "    while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels):\n",
684 |     "        \n",
685 |     "        if gene in geneToDisc and geneToDisc[gene] >= thresh and (gene, transcript) in v1Groups.groups:\n",
686 |     "\n",
687 |     "            for sgId_v1, row in v1Groups.get_group((gene, transcript)).sort(('Empirical activity score','Empirical activity score'),ascending=False).iterrows():\n",
688 |     "                oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant\n",
689 |     "                leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)\n",
690 |     "                if len(geneSgIds) < sgRNAsToPick and min(abs(row.loc['relative position'].iloc[2:])) < 5000 \\\n",
691 |     "                and row[('Empirical activity score','Empirical activity score')] >= .75 \\\n",
692 |     "                and row['off-target filters'].loc[offTargetLevels_v1[stringency]].all() \\\n",
693 |     "                and matchREsites(oligoSeq, restrictionSites) \\\n",
694 |     "                and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):\n",
695 |     "                    if len(geneSgIds) < 2:\n",
696 |     "                        geneSgIds.append((row[('library table v2','sgId_v2')],\n",
697 |     "                                          gene,transcript,\n",
698 |     "                                          row[('library table v2','sequence')], oligoSeq,\n",
699 |     "                                          np.nan,row[('Empirical activity score','Empirical activity score')],\n",
700 |     "                                         stringency))\n",
701 |     "                        geneLeftPositions.append(leftPos)\n",
702 |     "\n",
703 |     "                    empiricalSgIds[row[('library table v2','sgId_v2')]] = row[('Empirical activity score','Empirical activity score')]\n",
704 |     "\n",
705 |     "        adjustedScores = group.apply(lambda row: row[('predicted score','CRISPRiv2 predicted score')] + empiricalBonus if row.name in empiricalSgIds else row[('predicted score','CRISPRiv2 predicted score')], axis=1)\n",
706 |     "        adjustedScores.name = ('adjusted score','')\n",
707 |     "        for sgId_v2, row in pd.concat((group,adjustedScores),axis=1).sort(('adjusted score',''), ascending=False).iterrows():\n",
708 |     "            oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant\n",
709 |     "            leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)\n",
710 |     "            if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \\\n",
711 |     "                and matchREsites(oligoSeq, restrictionSites) \\\n",
712 |     "                and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):\n",
713 |     "                geneSgIds.append((sgId_v2,\n",
714 |     "                                  gene,transcript,\n",
715 |     "                                  row[('library table v2','sequence')], oligoSeq,\n",
716 |     "                                  row[('predicted score','CRISPRiv2 predicted score')], empiricalSgIds[sgId_v2] if sgId_v2 in empiricalSgIds else np.nan,\n",
717 |     "                                 stringency))\n",
718 |     "                geneLeftPositions.append(leftPos)\n",
719 |     "                \n",
720 |     "        stringency += 1\n",
721 |     "            \n",
722 |     "    if len(geneSgIds) < sgRNAsToPick:\n",
723 |     "        unfinishedTss.append((gene, transcript))\n",
724 |     "    else:\n",
725 |     "        newSgIds.extend(geneSgIds)\n",
726 |     "\n",
727 |     "    \n",
728 |     "libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence',\n",
729 |     " 'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID')"
730 |    ]
731 |   },
732 |   {
733 |    "cell_type": "markdown",
734 |    "metadata": {},
735 |    "source": [
736 |     "## Design negative controls matching the base composition of the library"
737 |    ]
738 |   },
739 |   {
740 |    "cell_type": "code",
741 |    "execution_count": null,
742 |    "metadata": {
743 |     "collapsed": true
744 |    },
745 |    "outputs": [],
746 |    "source": [
747 |     "#calcluate the base frequency at each position of the sgRNA, then generate random sequences weighted by this frequency\n",
748 |     "def getBaseFrequencies(libraryTable, baseConversion = {'G':0, 'C':1, 'T':2, 'A':3}):\n",
749 |     "    baseArray = np.zeros((len(libraryTable),20))\n",
750 |     "\n",
751 |     "    for i, (index, seq) in enumerate(libraryTable['protospacer sequence'].iteritems()):\n",
752 |     "        for j, char in enumerate(seq.upper()):\n",
753 |     "            baseArray[i,j] = baseConversion[char]\n",
754 |     "\n",
755 |     "    baseTable = pd.DataFrame(baseArray, index = libraryTable.index)\n",
756 |     "    \n",
757 |     "    baseFrequencies = baseTable.apply(lambda col: col.groupby(col).agg(len)).fillna(0) / len(baseTable)\n",
758 |     "    baseFrequencies.index = ['G','C','T','A']\n",
759 |     "    \n",
760 |     "    baseCumulativeFrequencies = baseFrequencies.copy()\n",
761 |     "    baseCumulativeFrequencies.loc['C'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C']\n",
762 |     "    baseCumulativeFrequencies.loc['T'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T']\n",
763 |     "    baseCumulativeFrequencies.loc['A'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T'] + baseFrequencies.loc['A']\n",
764 |     "\n",
765 |     "    return baseFrequencies, baseCumulativeFrequencies\n",
766 |     "\n",
767 |     "def generateRandomSequence(baseCumulativeFrequencies):\n",
768 |     "    randArray = np.random.random(baseCumulativeFrequencies.shape[1])\n",
769 |     "    \n",
770 |     "    seq = []\n",
771 |     "    for i, col in baseCumulativeFrequencies.iteritems():\n",
772 |     "        for base, freq in col.iteritems():\n",
773 |     "            if randArray[i] < freq:\n",
774 |     "                seq.append(base)\n",
775 |     "                break\n",
776 |     "                \n",
777 |     "    return ''.join(seq)"
778 |    ]
779 |   },
780 |   {
781 |    "cell_type": "code",
782 |    "execution_count": null,
783 |    "metadata": {
784 |     "collapsed": true
785 |    },
786 |    "outputs": [],
787 |    "source": [
788 |     "baseCumulativeFrequencies = getBaseFrequencies(libraryTable_complete)[1]\n",
789 |     "negList = []\n",
790 |     "for i in range(30000):\n",
791 |     "    negList.append(generateRandomSequence(baseCumulativeFrequencies))\n",
792 |     "negTable = pd.DataFrame(negList, index=['non-targeting_' + str(i) for i in range(30000)], columns = ['sequence'])\n",
793 |     "\n",
794 |     "outputTempBowtieFastq(negTable, TEMP_FASTQ_FILE)"
795 |    ]
796 |   },
797 |   {
798 |    "cell_type": "code",
799 |    "execution_count": null,
800 |    "metadata": {
801 |     "collapsed": true
802 |    },
803 |    "outputs": [],
804 |    "source": [
805 |     "#similar to targeting sgRNA off-target scoring, but looking for sgRNAs with 0 alignments\n",
806 |     "fqFile = TEMP_FASTQ_FILE\n",
807 |     "\n",
808 |     "alignmentList = [(31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS_negs'),\n",
809 |     "                (21,1,'~/indices/hg19.maskChrMandPAR','21_genome_negs')]\n",
810 |     "\n",
811 |     "alignmentColumns = []\n",
812 |     "for btThreshold, mflag, bowtieIndex, runname in alignmentList:\n",
813 |     "\n",
814 |     "    alignedFile = 'bowtie_output/' + runname + '_aligned.txt'\n",
815 |     "    unalignedFile = 'bowtie_output//' + runname + '_unaligned.fq'\n",
816 |     "    maxFile = 'bowtie_output/' + runname + '_max.fq'\n",
817 |     "    \n",
818 |     "    bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile\n",
819 |     "    print bowtieString\n",
820 |     "    print subprocess.call(bowtieString, shell=True)\n",
821 |     "\n",
822 |     "    #read unaligned file for negs, and then don't flip boolean of alignmentTable\n",
823 |     "    with open(unalignedFile) as infile:\n",
824 |     "        sgsAligning = set()\n",
825 |     "        for i, line in enumerate(infile):\n",
826 |     "            if i%4 == 0: #id line\n",
827 |     "                sgsAligning.add(line.strip()[1:])\n",
828 |     "\n",
829 |     "    alignmentColumns.append(negTable.apply(lambda row: row.name in sgsAligning, axis=1))\n",
830 |     "    \n",
831 |     "alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3])\n",
832 |     "alignmentTable.head()"
833 |    ]
834 |   },
835 |   {
836 |    "cell_type": "code",
837 |    "execution_count": null,
838 |    "metadata": {
839 |     "collapsed": true
840 |    },
841 |    "outputs": [],
842 |    "source": [
843 |     "acceptedNegList = []\n",
844 |     "negCount = 0\n",
845 |     "for i, (name, row) in enumerate(pd.concat((negTable,alignmentTable),axis=1, keys=['seq','alignment']).iterrows()):\n",
846 |     "    oligo = upstreamConstant + row['seq','sequence'] + downstreamConstant\n",
847 |     "    if row['alignment'].all() and matchREsites(oligo, restrictionSites):\n",
848 |     "        acceptedNegList.append(('non-targeting_%05d' % negCount, 'negative_control', 'na', row['seq','sequence'], oligo, 0))\n",
849 |     "        negCount += 1\n",
850 |     "        \n",
851 |     "acceptedNegs = pd.DataFrame(acceptedNegList, columns = ['sgId', 'gene', 'transcript', 'protospacer sequence', 'oligo sequence', 'off-target stringency']).set_index('sgId')"
852 |    ]
853 |   },
854 |   {
855 |    "cell_type": "markdown",
856 |    "metadata": {},
857 |    "source": [
858 |     "## Finalizing library design"
859 |    ]
860 |   },
861 |   {
862 |    "cell_type": "markdown",
863 |    "metadata": {},
864 |    "source": [
865 |     "* divide genes into sublibrary groups (if required)\n",
866 |     "* assign negative control sgRNAs to sublibrary groups; ~1-2% of the number of sgRNAs in the library is a good rule-of-thumb\n",
867 |     "* append PCR adapter sequences (~18bp) to each end of the oligo sequences to enable amplification of the oligo pool; each sublibary should have an orthogonal sequence so they can be cloned separately"
868 |    ]
869 |   },
870 |   {
871 |    "cell_type": "code",
872 |    "execution_count": null,
873 |    "metadata": {
874 |     "collapsed": true
875 |    },
876 |    "outputs": [],
877 |    "source": []
878 |   }
879 |  ],
880 |  "metadata": {
881 |   "kernelspec": {
882 |    "display_name": "Python 2",
883 |    "language": "python",
884 |    "name": "python2"
885 |   },
886 |   "language_info": {
887 |    "codemirror_mode": {
888 |     "name": "ipython",
889 |     "version": 2
890 |    },
891 |    "file_extension": ".py",
892 |    "mimetype": "text/x-python",
893 |    "name": "python",
894 |    "nbconvert_exporter": "python",
895 |    "pygments_lexer": "ipython2",
896 |    "version": "2.7.3"
897 |   }
898 |  },
899 |  "nbformat": 4,
900 |  "nbformat_minor": 0
901 | }
902 | 


--------------------------------------------------------------------------------
/sgRNA_learning.py:
--------------------------------------------------------------------------------
   1 | import os
   2 | import sys
   3 | import subprocess
   4 | import tempfile
   5 | import multiprocessing
   6 | import numpy as np 
   7 | import scipy as sp 
   8 | import pandas as pd
   9 | from ConfigParser import SafeConfigParser
  10 | from Bio import Seq, SeqIO
  11 | import pysam
  12 | from bx.bbi.bigwig_file import BigWigFile
  13 | from sklearn import linear_model, svm, ensemble, preprocessing, grid_search, metrics
  14 | 
  15 | from expt_config_parser import parseExptConfig, parseLibraryConfig
  16 | 
  17 | ###############################################################################
  18 | #                   Import and Merge Training/Test Data                       #
  19 | ###############################################################################
  20 | def loadExperimentData(experimentFile, supportedLibraryPath, library, basePath = '.'):
  21 | 	libDict, librariesToTables = parseLibraryConfig(os.path.join(supportedLibraryPath, 'library_config.txt'))
  22 | 
  23 | 	geneTableDict = dict()
  24 | 	phenotypeTableDict = dict()
  25 | 	libraryTableDict = dict()
  26 | 
  27 | 	parser = SafeConfigParser()
  28 | 	parser.read(experimentFile)
  29 | 	for exptConfigFile in parser.sections():
  30 | 		configDict = parseExptConfig(exptConfigFile,libDict)[0]
  31 | 
  32 | 		libraryTable = pd.read_csv(os.path.join(basePath,configDict['output_folder'],configDict['experiment_name']) + '_librarytable.txt',
  33 | 			sep='\t', index_col=range(1), header=0)
  34 | 		libraryTableDict[configDict['experiment_name']] = libraryTable
  35 | 
  36 | 		geneTable = pd.read_csv(os.path.join(basePath,configDict['output_folder'],configDict['experiment_name']) + '_genetable.txt',
  37 | 			sep='\t',index_col=range(2),header=range(3))
  38 | 		phenotypeTable = pd.read_csv(os.path.join(basePath,configDict['output_folder'],configDict['experiment_name']) + '_phenotypetable.txt',\
  39 | 			sep='\t',index_col=range(1),header=range(2))
  40 | 
  41 | 		condTups = [(condStr.split(':')[0],condStr.split(':')[1]) for condStr in parser.get(exptConfigFile, 'condition_tuples').strip().split('\n')]
  42 | 		# print condTups
  43 | 
  44 | 		geneTableDict[configDict['experiment_name']] = geneTable.loc[:,[level_name for level_name in geneTable.columns if (level_name[0],level_name[1]) in condTups]]
  45 | 		phenotypeTableDict[configDict['experiment_name']] = phenotypeTable.loc[:,[level_name for level_name in phenotypeTable.columns if (level_name[0],level_name[1]) in condTups]]
  46 | 
  47 | 	mergedLibraryTable = pd.concat(libraryTableDict.values())
  48 | 	# print mergedLibraryTable.head()
  49 | 	mergedLibraryTable_dedup = mergedLibraryTable.drop_duplicates(['gene','sequence'])
  50 | 	# print mergedLibraryTable_dedup.head()
  51 | 	mergedGeneTable = pd.concat(geneTableDict.values(), keys=geneTableDict.keys(), axis = 1)
  52 | 	# print mergedGeneTable.head()
  53 | 	mergedPhenotypeTable = pd.concat(phenotypeTableDict.values(), keys=phenotypeTableDict.keys(), axis = 1)
  54 | 	# print mergedPhenotypeTable.head()
  55 | 	mergedPhenotypeTable_dedup = mergedPhenotypeTable.loc[mergedLibraryTable_dedup.index]
  56 | 
  57 | 	return mergedLibraryTable_dedup, mergedPhenotypeTable_dedup, mergedGeneTable
  58 | 
  59 | def calculateDiscriminantScores(geneTable, effectSize = 'average phenotype of strongest 3', pValue = 'Mann-Whitney p-value'):
  60 | 	isPseudo = getPseudoIndices(geneTable)
  61 | 	geneTable_reordered = geneTable.reorder_levels((3,0,1,2), axis=1)
  62 | 	zscores = geneTable_reordered[effectSize] / geneTable_reordered.loc[isPseudo,effectSize].std()
  63 | 	pvals = -1 * np.log10(geneTable_reordered[pValue])
  64 | 
  65 | 	seriesDict = dict()
  66 | 	for group, table in pd.concat((zscores, pvals), keys=(effectSize,pValue),axis=1).reorder_levels((1,2,3,0), axis=1).groupby(level=range(3),axis=1):
  67 | 		# print table.head()
  68 | 		seriesDict[group] = table[group].apply(lambda row: row[effectSize] * row[pValue], axis=1)
  69 | 
  70 | 	return pd.DataFrame(seriesDict)
  71 | 
  72 | def getNormalizedsgRNAsOverThresh(libraryTable, phenotypeTable, discriminantTable, threshold, numToNormalize, transcripts=True):
  73 | 	maxDiscriminants = pd.concat([discriminantTable.abs().idxmax(axis=1), discriminantTable.abs().max(axis=1)], keys = ('best col','best score'), axis=1)
  74 | 
  75 | 	if transcripts:
  76 | 		grouper = (libraryTable['gene'],libraryTable['transcripts'])
  77 | 	else:
  78 | 		grouper = libraryTable['gene']
  79 | 
  80 | 	normedPhenotypes = []
  81 | 	for name, group in phenotypeTable.groupby(grouper):
  82 | 		if (transcripts and name[0] == 'negative_control') or (not transcripts and name == 'negative_control'):
  83 | 			continue
  84 | 		maxDisc = maxDiscriminants.loc[name]
  85 | 
  86 | 		if not transcripts:
  87 | 			maxDisc = maxDisc.sort('best score').iloc[-1]
  88 | 
  89 | 		if maxDisc['best score'] >= threshold:
  90 | 			bestGroup = group[maxDisc['best col']]
  91 | 			normedPhenotypes.append(bestGroup / np.mean(sorted(bestGroup.dropna(), key=abs, reverse=True)[:numToNormalize]))
  92 | 
  93 | 	return pd.concat(normedPhenotypes), maxDiscriminants
  94 | 
  95 | def getGeneFolds(libraryTable, kfold, transcripts=True):
  96 | 	if transcripts:
  97 | 		geneGroups = pd.Series(range(len(libraryTable)), index=libraryTable.index).groupby((libraryTable['gene'],libraryTable['transcripts']))
  98 | 	else:
  99 | 		geneGroups = pd.Series(range(len(libraryTable)), index=libraryTable.index).groupby(libraryTable['gene'])
 100 | 
 101 | 	idxList = np.arange(geneGroups.ngroups)
 102 | 	np.random.shuffle(idxList)
 103 | 
 104 | 	foldsize = int(np.floor(geneGroups.ngroups * 1.0 / kfold))
 105 | 	folds = []
 106 | 	for i in range(kfold):
 107 | 		testGroups = []
 108 | 		trainGroups = []
 109 | 		testSet = set(idxList[i * foldsize: (i+1) * foldsize])
 110 | 		for i, (name, group) in enumerate(geneGroups):
 111 | 			if i in testSet:
 112 | 				testGroups.extend(group.values)
 113 | 			else:
 114 | 				trainGroups.extend(group.values)
 115 | 		folds.append((trainGroups,testGroups))
 116 | 
 117 | 	return folds
 118 | 
 119 | 
 120 | ###############################################################################
 121 | #                           Calculate sgRNA Parameters                        #
 122 | ###############################################################################
 123 | #tss annotations relying on library input TSSs, may want to convert to gencode in future
 124 | def generateTssTable(geneTable, libraryTssFile, cagePeakFile, cageWindow, aliasDict = {'NFIK':'MKI67IP'}):
 125 | 	codingTssList = []
 126 | 	with open(libraryTssFile) as infile:
 127 | 		for line in infile:
 128 | 			linesplit = line.strip().split('\t')
 129 | 			try:
 130 | 				chrom = int(linesplit[2][3:])
 131 | 			except ValueError:
 132 | 				chrom = linesplit[2][3:]
 133 | 			codingTssList.append((chrom, int(linesplit[3]), linesplit[0], linesplit[1], linesplit[2], linesplit[3], linesplit[4]))
 134 | 
 135 | 	codingTupDict = {(tup[2],tup[3]):tup for tup in codingTssList}
 136 | 
 137 | 	codingGeneToTransList = dict()
 138 | 
 139 | 	for geneTrans in codingTupDict:
 140 | 		if geneTrans[0] not in codingGeneToTransList:
 141 | 			codingGeneToTransList[geneTrans[0]] = []
 142 | 			
 143 | 		codingGeneToTransList[geneTrans[0]].append(geneTrans[1])
 144 | 
 145 | 	positionList = []
 146 | 	for (gene,transcriptList), row in geneTable.iterrows():
 147 | 		
 148 | 		if gene not in codingGeneToTransList: #only pseudogenes
 149 | 			positionList.append((np.nan,np.nan,np.nan))
 150 | 			continue
 151 | 			
 152 | 		if transcriptList == 'all':
 153 | 			positions = [codingTupDict[(gene, trans)][1] for trans in codingGeneToTransList[gene]]
 154 | 		else:
 155 | 			positions = [codingTupDict[(gene, trans)][1] for trans in transcriptList.split(',')]
 156 | 			
 157 | 		positionList.append((np.mean(positions), codingTupDict[(gene,trans)][6], codingTupDict[(gene,trans)][4]))
 158 | 			
 159 | 	tssPositionTable = pd.DataFrame(positionList, index=geneTable.index, columns=['position', 'strand','chromosome'])
 160 | 
 161 | 	cagePeaks = pysam.Tabixfile(cagePeakFile)
 162 | 	halfwindow = cageWindow
 163 | 	strictColor = '60,179,113'
 164 | 	relaxedColor = '30,144,255'
 165 | 
 166 | 	cagePeakRanges = []
 167 | 	for i, (gt, tssRow) in enumerate(tssPositionTable.dropna().iterrows()):
 168 | 		peaks = cagePeaks.fetch(tssRow['chromosome'],tssRow['position'] - halfwindow,tssRow['position'] + halfwindow, parser=pysam.asBed())
 169 | 		
 170 | 		ranges = []
 171 | 		relaxedRanges = []
 172 | 		for peak in peaks:
 173 | 	#         print peak
 174 | 			if peak.strand == tssRow['strand'] and peak.itemRGB == strictColor:
 175 | 				ranges.append((peak.start, peak.end))
 176 | 			elif peak.strand == tssRow['strand'] and peak.itemRGB == relaxedColor:
 177 | 				relaxedRanges.append((peak.start, peak.end))
 178 | 
 179 | 		if len(ranges) > 0:
 180 | 			cagePeakRanges.append(ranges)
 181 | 		else:
 182 | 			cagePeakRanges.append(relaxedRanges)
 183 | 		
 184 | 	cageSeries = pd.Series(cagePeakRanges, index = tssPositionTable.dropna().index)
 185 | 
 186 | 	tssPositionTable_cage = pd.concat([tssPositionTable, cageSeries], axis=1)
 187 | 	tssPositionTable_cage.columns = ['position', 'strand','chromosome','cage peak ranges']
 188 | 	return tssPositionTable_cage
 189 | 
 190 | 	# for (gene, transList), row in geneTable.iterrows():
 191 | 	#   if gene not in gencodeData and gene in aliasDict:
 192 | 	#       geneData = gencodeData[aliasDict[gene]]
 193 | 	#   else:
 194 | 	#       geneData = gencodeData[gene]
 195 | 
 196 | def generateTssTable_P1P2strategy(tssTable, cagePeakFile, matchedp1p2Window, anyp1p2Window, anyPeakWindow, minDistanceForTwoTSS, aliasDict):
 197 | 	cagePeaks = pysam.Tabixfile(cagePeakFile)
 198 | 	strictColor = '60,179,113'
 199 | 	relaxedColor = '30,144,255'
 200 | 
 201 | 	resultRows = []
 202 | 	for gene, tssRowGroup in tssTable.groupby(level=0):
 203 | 
 204 | 		if len(set(tssRowGroup['chromosome'].values)) == 1:
 205 | 			chrom = tssRowGroup['chromosome'].values[0]
 206 | 		else:
 207 | 			raise ValueError('mutliple annotated chromosomes for ' + gene)
 208 | 
 209 | 		if len(set(tssRowGroup['strand'].values)) == 1:
 210 | 			strand = tssRowGroup['strand'].values[0]
 211 | 		else:
 212 | 			raise ValueError('mutliple annotated strands for ' + gene)
 213 | 
 214 | 		#try to match P1/P2 names within the window
 215 | 		# peaks = cagePeaks.fetch(chrom,max(0,tssRowGroup['position'].min() - matchedp1p2Window),tssRowGroup['position'].max() + matchedp1p2Window, parser=pysam.asBed())
 216 | 		peaks = []
 217 | 		for transcript, row in tssRowGroup.iterrows():
 218 | 			peaks.extend([p for p in cagePeaks.fetch(chrom,max(0,row['position'] - matchedp1p2Window),row['position'] + matchedp1p2Window, parser=pysam.asBed())])
 219 | 		p1Matches = set()
 220 | 		p2Matches = set()
 221 | 		for peak in peaks:
 222 | 			if peak.strand == strand and matchPeakName(peak.name, aliasDict[gene] if gene in aliasDict else [gene], 'p1'):
 223 | 				p1Matches.add((peak.start,peak.end))
 224 | 			elif peak.strand == strand and matchPeakName(peak.name, aliasDict[gene] if gene in aliasDict else [gene], 'p2') and peak.itemRGB == strictColor:
 225 | 				p2Matches.add((peak.start,peak.end))
 226 | 		p1Matches = list(p1Matches)
 227 | 		p2Matches = list(p2Matches)
 228 | 
 229 | 		if len(p1Matches) >= 1:
 230 | 			if len(p1Matches) > 1:
 231 | 				print 'multiple matched p1:', gene, p1Matches, p2Matches #rare event, typically a doubly-named TSS, basically at the same spot
 232 | 
 233 | 				closestMatch = p1Matches[0]
 234 | 				for match in p1Matches:
 235 | 					if min(abs(match[0] - tssRowGroup['position'])) < min(abs(closestMatch[0] - tssRowGroup['position'])):
 236 | 						closestMatch = match
 237 | 				p1Matches = [closestMatch]
 238 | 
 239 | 			if len(p2Matches) > 1:
 240 | 				print 'multiple matched p2:', gene, p1Matches, p2Matches
 241 | 
 242 | 				closestMatch = p2Matches[0]
 243 | 				for match in p2Matches:
 244 | 					if min(abs(match[0] - tssRowGroup['position'])) < min(abs(closestMatch[0] - tssRowGroup['position'])):
 245 | 						closestMatch = match
 246 | 				p2Matches = [closestMatch]
 247 | 
 248 | 			if len(p2Matches) == 0 or abs(p1Matches[0][0] - p2Matches[0][0]) <= minDistanceForTwoTSS:
 249 | 				resultRows.append((gene,'P1P2', chrom, strand, 'CAGE, matched peaks', p1Matches[0], p2Matches[0] if len(p2Matches) > 0 else p1Matches[0]))
 250 | 			else:
 251 | 				resultRows.append((gene,'P1', chrom, strand, 'CAGE, matched peaks', p1Matches[0], p1Matches[0]))
 252 | 				resultRows.append((gene,'P2', chrom, strand, 'CAGE, matched peaks', p2Matches[0], p2Matches[0]))
 253 | 
 254 | 
 255 | 		#try to match any P1/P2 names 
 256 | 		else:
 257 | 			peaks = []
 258 | 			for transcript, row in tssRowGroup.iterrows():
 259 | 				peaks.extend([p for p in cagePeaks.fetch(chrom,max(0,row['position'] - anyp1p2Window),row['position'] + anyp1p2Window, parser=pysam.asBed())])
 260 | 			p1Matches = set()
 261 | 			p2Matches = set()
 262 | 			for peak in peaks:
 263 | 				if peak.strand == strand and peak.name.find('p1@') != -1:
 264 | 					p1Matches.add((peak.start,peak.end))
 265 | 				elif peak.strand == strand  and peak.name.find('p2@') != -1 and peak.itemRGB == strictColor:
 266 | 					p2Matches.add((peak.start,peak.end))
 267 | 			p1Matches = list(p1Matches)
 268 | 			p2Matches = list(p2Matches)
 269 | 
 270 | 			if len(p1Matches) >=1:
 271 | 				if len(p1Matches) > 1:
 272 | 					print 'multiple nearby p1:', gene, p1Matches, p2Matches
 273 | 
 274 | 					closestMatch = p1Matches[0]
 275 | 					for match in p1Matches:
 276 | 						if min(abs(match[0] - tssRowGroup['position'])) < min(abs(closestMatch[0] - tssRowGroup['position'])):
 277 | 							closestMatch = match
 278 | 					p1Matches = [closestMatch]
 279 | 
 280 | 				if len(p2Matches) > 1:
 281 | 					print 'multiple nearby p2:', gene, p1Matches, p2Matches 
 282 | 
 283 | 					closestMatch = p2Matches[0]
 284 | 					for match in p2Matches:
 285 | 						if min(abs(match[0] - tssRowGroup['position'])) < min(abs(closestMatch[0] - tssRowGroup['position'])):
 286 | 							closestMatch = match
 287 | 					p2Matches = [closestMatch]
 288 | 
 289 | 				if len(p2Matches) == 0 or abs(p1Matches[0][0] - p2Matches[0][0]) <= minDistanceForTwoTSS:
 290 | 					resultRows.append((gene,'P1P2', chrom, strand, 'CAGE, primary peaks', p1Matches[0], p2Matches[0] if len(p2Matches) > 0 else p1Matches[0]))
 291 | 				else:
 292 | 					resultRows.append((gene,'P1', chrom, strand, 'CAGE, primary peaks', p1Matches[0], p1Matches[0]))
 293 | 					resultRows.append((gene,'P2', chrom, strand, 'CAGE, primary peaks', p2Matches[0], p2Matches[0]))
 294 | 
 295 | 
 296 | 			#try to match robust or permissive peaks
 297 | 			else:
 298 | 				for transcript, row in tssRowGroup.iterrows():
 299 | 					peaks = cagePeaks.fetch(chrom,max(0,row['position']) - anyPeakWindow,row['position'] + anyPeakWindow, parser=pysam.asBed())
 300 | 					robustPeaks = []
 301 | 					permissivePeaks = []
 302 | 					for peak in peaks:
 303 | 						if peak.strand == strand and peak.itemRGB == strictColor:
 304 | 							robustPeaks.append((peak.start,peak.end))
 305 | 						if peak.strand == strand and peak.itemRGB == relaxedColor:
 306 | 							permissivePeaks.append((peak.start,peak.end))
 307 | 
 308 | 					if len(robustPeaks) >= 1:
 309 | 						if strand == '+':
 310 | 							resultRows.append((gene,transcript[1], chrom, strand, 'CAGE, robust peak', robustPeaks[0], robustPeaks[-1]))
 311 | 						else:
 312 | 							resultRows.append((gene,transcript[1], chrom, strand, 'CAGE, robust peak', robustPeaks[-1], robustPeaks[0]))
 313 | 					elif len(permissivePeaks) >= 1:
 314 | 						if strand == '+':
 315 | 							resultRows.append((gene,transcript[1], chrom, strand, 'CAGE permissive peak', permissivePeaks[0], permissivePeaks[-1]))
 316 | 						else:
 317 | 							resultRows.append((gene,transcript[1], chrom, strand, 'CAGE permissive peak', permissivePeaks[-1], permissivePeaks[0]))
 318 | 					else:
 319 | 						resultRows.append((gene, transcript[1], chrom, strand, 'Annotation', (row['position'],row['position']), (row['position'],row['position'])))
 320 | 
 321 | 	return pd.DataFrame(resultRows, columns=['gene','transcript','chromosome','strand','TSS source','primary TSS','secondary TSS']).set_index(keys=['gene','transcript'])
 322 | 
 323 | def generateSgrnaDistanceTable_p1p2Strategy(sgInfoTable, libraryTable, p1p2Table, transcripts=False):
 324 | 	sgDistanceSeries = []
 325 | 
 326 | 	if transcripts == False: # when sgRNAs weren't designed based on the p1p2 strategy
 327 | 		for name, group in sgInfoTable['pam coordinate'].groupby(libraryTable['gene']):
 328 | 			if name in p1p2Table.index:
 329 | 				tssRow = p1p2Table.loc[name]
 330 | 
 331 | 				if len(tssRow) == 1:
 332 | 					tssRow = tssRow.iloc[0]
 333 | 					for sgId, pamCoord in group.iteritems():
 334 | 						if tssRow['strand'] == '+':
 335 | 							sgDistanceSeries.append((sgId, name, tssRow.name,
 336 | 								pamCoord - tssRow['primary TSS'][0],
 337 | 								pamCoord - tssRow['primary TSS'][1],
 338 | 								pamCoord - tssRow['secondary TSS'][0],
 339 | 								pamCoord - tssRow['secondary TSS'][1]))
 340 | 						else:
 341 | 							sgDistanceSeries.append((sgId, name, tssRow.name,
 342 | 								(pamCoord - tssRow['primary TSS'][1]) * -1,
 343 | 								(pamCoord - tssRow['primary TSS'][0]) * -1,
 344 | 								(pamCoord - tssRow['secondary TSS'][1]) * -1,
 345 | 								(pamCoord - tssRow['secondary TSS'][0]) * -1))
 346 | 
 347 | 				else:
 348 | 					for sgId, pamCoord in group.iteritems():
 349 | 						closestTssRow = tssRow.loc[tssRow.apply(lambda row: abs(pamCoord - row['primary TSS'][0]), axis=1).idxmin()]
 350 | 
 351 | 						if closestTssRow['strand'] == '+':
 352 | 							sgDistanceSeries.append((sgId, name, closestTssRow.name,
 353 | 								pamCoord - closestTssRow['primary TSS'][0],
 354 | 								pamCoord - closestTssRow['primary TSS'][1],
 355 | 								pamCoord - closestTssRow['secondary TSS'][0],
 356 | 								pamCoord - closestTssRow['secondary TSS'][1]))
 357 | 						else:
 358 | 							sgDistanceSeries.append((sgId, name, closestTssRow.name,
 359 | 								(pamCoord - closestTssRow['primary TSS'][1]) * -1,
 360 | 								(pamCoord - closestTssRow['primary TSS'][0]) * -1,
 361 | 								(pamCoord - closestTssRow['secondary TSS'][1]) * -1,
 362 | 								(pamCoord - closestTssRow['secondary TSS'][0]) * -1))
 363 | 	else:
 364 | 		for name, group in sgInfoTable['pam coordinate'].groupby([libraryTable['gene'],libraryTable['transcripts']]):
 365 | 			if name in p1p2Table.index:
 366 | 				tssRow = p1p2Table.loc[[name]]
 367 | 
 368 | 				if len(tssRow) == 1:
 369 | 					tssRow = tssRow.iloc[0]
 370 | 					for sgId, pamCoord in group.iteritems():
 371 | 						if tssRow['strand'] == '+':
 372 | 							sgDistanceSeries.append((sgId, tssRow.name[0], tssRow.name[1],
 373 | 								pamCoord - tssRow['primary TSS'][0],
 374 | 								pamCoord - tssRow['primary TSS'][1],
 375 | 								pamCoord - tssRow['secondary TSS'][0],
 376 | 								pamCoord - tssRow['secondary TSS'][1]))
 377 | 						else:
 378 | 							sgDistanceSeries.append((sgId, tssRow.name[0], tssRow.name[1],
 379 | 								(pamCoord - tssRow['primary TSS'][1]) * -1,
 380 | 								(pamCoord - tssRow['primary TSS'][0]) * -1,
 381 | 								(pamCoord - tssRow['secondary TSS'][1]) * -1,
 382 | 								(pamCoord - tssRow['secondary TSS'][0]) * -1))
 383 | 
 384 | 				else:
 385 | 					print name, tssRow
 386 | 					raise ValueError('all gene/trans pairs should be unique')
 387 | 
 388 | 	return pd.DataFrame(sgDistanceSeries, columns=['sgId', 'gene', 'transcript', 'primary TSS-Up', 'primary TSS-Down', 'secondary TSS-Up', 'secondary TSS-Down']).set_index(keys=['sgId'])
 389 | 
 390 | def generateSgrnaDistanceTable(sgInfoTable, tssTable, libraryTable):
 391 | 	sgDistanceSeries = []
 392 | 
 393 | 	for name, group in sgInfoTable['pam coordinate'].groupby([libraryTable['gene'],libraryTable['transcripts']]):
 394 | 		if name in tssTable.index:
 395 | 			tssRow = tssTable.loc[name]
 396 | 			if len(tssRow['cage peak ranges']) != 0:
 397 | 				spotList = []
 398 | 				for rangeTup in tssRow['cage peak ranges']:
 399 | 					spotList.append((rangeTup[0] - tssRow['position']) * (-1 if tssRow['strand'] == '-' else 1))
 400 | 					spotList.append((rangeTup[1] - tssRow['position']) * (-1 if tssRow['strand'] == '-' else 1))
 401 | 					
 402 | 				sgDistanceSeries.append(group.apply(lambda row: distanceMetrics(row, tssRow['position'], min(spotList),max(spotList),tssRow['strand'])))
 403 | 				
 404 | 			else:
 405 | 				sgDistanceSeries.append(group.apply(lambda row: distanceMetrics(row, tssRow['position'], 0, 0, tssRow['strand'])))
 406 | 		
 407 | 	return pd.concat(sgDistanceSeries)
 408 | 
 409 | def distanceMetrics(position, annotatedTss, cageUp, cageDown, strand):
 410 | 	relativePos = (position - annotatedTss) * (1 if strand == '+' else -1)
 411 | 	
 412 | 	return pd.Series((relativePos, relativePos-cageUp, relativePos-cageDown), index=('annotated','cageUp','cageDown'))
 413 | 
 414 | def generateSgrnaLengthSeries(libraryTable):
 415 | 	lengthSeries =  libraryTable.apply(lambda row: len(row['sequence']),axis=1)
 416 | 	lengthSeries.name = 'length'
 417 | 	return lengthSeries
 418 | 
 419 | def generateRelativeBasesAndStrand(sgInfoTable, tssTable, libraryTable, genomeDict):
 420 | 	relbases = []
 421 | 	strands = []
 422 | 	sgIds = []
 423 | 	for gene, sgInfoGroup in sgInfoTable.groupby(libraryTable['gene']):
 424 | 		tssRowGroup = tssTable.loc[gene]
 425 | 
 426 | 		if len(set(tssRowGroup['chromosome'].values)) == 1:
 427 | 			chrom = tssRowGroup['chromosome'].values[0]
 428 | 		else:
 429 | 			raise ValueError('mutliple annotated chromosomes for ' + gene)
 430 | 
 431 | 		if len(set(tssRowGroup['strand'].values)) == 1:
 432 | 			strand = tssRowGroup['strand'].values[0]
 433 | 		else:
 434 | 			raise ValueError('mutliple annotated strands for ' + gene)
 435 | 
 436 | 		for sg, sgInfo in sgInfoGroup.iterrows():
 437 | 			sgIds.append(sg)
 438 | 			geneTup = (sgInfo['gene_name'],','.join(sgInfo['transcript_list']))
 439 | 			strands.append(True if sgInfo['strand'] == strand else False)
 440 | 
 441 | 			baseMatrix = []
 442 | 			for pos in np.arange(-30,10):
 443 | 				baseMatrix.append(getBaseRelativeToPam(chrom, sgInfo['pam coordinate'],sgInfo['length'], sgInfo['strand'], pos, genomeDict))
 444 | 			relbases.append(baseMatrix)
 445 | 
 446 | 	relbases = pd.DataFrame(relbases, index = sgIds, columns = np.arange(-30,10)).loc[libraryTable.index]
 447 | 	strands = pd.DataFrame(strands, index = sgIds, columns = ['same strand']).loc[libraryTable.index]
 448 | 
 449 | 	return relbases, strands
 450 | 
 451 | def generateBooleanBaseTable(baseTable):
 452 | 	relbases_bool = []
 453 | 	for base in ['A','G','C','T']:
 454 | 		relbases_bool.append(baseTable.applymap(lambda val: val == base))
 455 | 
 456 | 	return pd.concat(relbases_bool, keys=['A','G','C','T'], axis=1)
 457 | 
 458 | def generateBooleanDoubleBaseTable(baseTable):
 459 | 	doubleBaseTable = []
 460 | 	tableCols = []
 461 | 	for b1 in ['A','G','C','T']:
 462 | 		for b2 in ['A','G','C','T']:
 463 | 			for i in np.arange(-30,8):
 464 | 				doubleBaseTable.append(pd.concat((baseTable[i] == b1, baseTable[i+1] == b2),axis=1).all(axis=1))
 465 | 				tableCols.append(((b1,b2),i))
 466 | 	return pd.concat(doubleBaseTable, keys=tableCols, axis=1)
 467 | 
 468 | def getBaseRelativeToPam(chrom, pamPos, length, strand, relPos, genomeDict):
 469 | 	rc = {'A':'T','T':'A','G':'C','C':'G','N':'N'}
 470 | 	#print chrom, pamPos, relPos
 471 | 	if strand == '+':
 472 | 		return rc[genomeDict[chrom][pamPos - relPos].upper()]
 473 | 	elif strand == '-':
 474 | 		return genomeDict[chrom][pamPos + relPos].upper()
 475 | 	else:
 476 | 		raise ValueError()
 477 | 
 478 | def getMaxLengthHomopolymer(sequence, base):
 479 | 	sequence = sequence.upper()
 480 | 	base = base.upper()
 481 | 	
 482 | 	maxBaseCount = 0
 483 | 	curBaseCount = 0
 484 | 	for b in sequence:
 485 | 		if b == base:
 486 | 			curBaseCount += 1
 487 | 		else:
 488 | 			maxBaseCount = max((curBaseCount, maxBaseCount))
 489 | 			curBaseCount = 0
 490 | 		
 491 | 	return max((curBaseCount, maxBaseCount))
 492 | 
 493 | def getFractionBaseList(sequence, baseList):
 494 | 	baseSet = [base.upper() for base in baseList]
 495 | 	counter = 0.0
 496 | 	for b in sequence.upper():
 497 | 		if b in baseSet:
 498 | 			counter += 1.0
 499 | 			
 500 | 	return counter / len(sequence)
 501 | 
 502 | #need to fix file naming
 503 | def getRNAfoldingTable(libraryTable):
 504 | 	tempfile_fa = tempfile.NamedTemporaryFile('w+t', delete=False)
 505 | 	tempfile_rnafold = tempfile.NamedTemporaryFile('w+t', delete=False)
 506 | 
 507 | 	for name, row in libraryTable.iterrows():
 508 | 		tempfile_fa.write('>' + name + '\n' + row['sequence'] + '\n')
 509 | 
 510 | 	tempfile_fa.close()
 511 | 	tempfile_rnafold.close()
 512 | 	# print tempfile_fa.name, tempfile_rnafold.name
 513 | 
 514 | 	subprocess.call('RNAfold --noPS < %s > %s' % (tempfile_fa.name, tempfile_rnafold.name), shell=True)
 515 | 
 516 | 	mfeSeries_noScaffold = parseViennaMFE(tempfile_rnafold.name, libraryTable)
 517 | 	isPaired = parseViennaPairing(tempfile_rnafold.name, libraryTable)
 518 | 
 519 | 	tempfile_fa = tempfile.NamedTemporaryFile('w+t', delete=False)
 520 | 	tempfile_rnafold = tempfile.NamedTemporaryFile('w+t', delete=False)
 521 | 
 522 | 	with open(tempfile_fa.name,'w') as outfile:
 523 | 		for name, row in libraryTable.iterrows():
 524 | 			outfile.write('>' + name + '\n' + row['sequence'] + 'GTTTAAGAGCTAAGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT\n')
 525 | 
 526 | 	tempfile_fa.close()
 527 | 	tempfile_rnafold.close()
 528 | 	# print tempfile_fa.name, tempfile_rnafold.name
 529 | 
 530 | 	subprocess.call('RNAfold --noPS < %s > %s' % (tempfile_fa.name, tempfile_rnafold.name), shell=True)
 531 | 
 532 | 	mfeSeries_wScaffold = parseViennaMFE(tempfile_rnafold.name, libraryTable)
 533 | 
 534 | 	return pd.concat((mfeSeries_noScaffold, mfeSeries_wScaffold, isPaired), keys=('no scaffold', 'with scaffold', 'is Paired'), axis=1)
 535 | 
 536 | def parseViennaMFE(viennaOutputFile, libraryTable):
 537 | 	mfeList = []
 538 | 	with open(viennaOutputFile) as infile:
 539 | 		for i, line in enumerate(infile):
 540 | 			if i%3 == 2:
 541 | 				mfeList.append(float(line.strip().strip('.() ')))
 542 | 	return pd.Series(mfeList, index=libraryTable.index, name='RNA minimum free energy')
 543 | 
 544 | def parseViennaPairing(viennaOutputFile, libraryTable):
 545 | 	paired = []
 546 | 	with open(viennaOutputFile) as infile:
 547 | 		for i, line in enumerate(infile):
 548 | 			if i%3 == 2:
 549 | 				foldString = line.strip().split(' ')[0]
 550 | 				paired.append([char != '.' for char in foldString[-18:]])
 551 | 	return pd.DataFrame(paired, index=libraryTable.index, columns = range(-20,-2))
 552 | 
 553 | def getChromatinDataSeries(bigwigFile, libraryTable, sgInfoTable, tssTable, colname = '', naValue = 0):
 554 | 	bwindex = BigWigFile(open(bigwigFile))
 555 | 	chromDict = tssTable['chromosome'].to_dict()
 556 | 
 557 | 	chromatinScores = []
 558 | 	for name, sgInfo in sgInfoTable.iterrows():
 559 | 		geneTup = (sgInfo['gene_name'],','.join(sgInfo['transcript_list']))
 560 | 
 561 | 		if geneTup not in chromDict: #negative controls
 562 | 			chromatinScores.append(np.nan)
 563 | 			continue
 564 | 
 565 | 		if sgInfo['strand'] == '+':
 566 | 			sgRange = sgInfo['pam coordinate'] + sgInfo['length']
 567 | 		else:
 568 | 			sgRange = sgInfo['pam coordinate'] - sgInfo['length']
 569 | 
 570 | 		chrom = chromDict[geneTup]
 571 | 		
 572 | 		chromatinArray = bwindex.get_as_array(chrom, min(sgInfo['pam coordinate'], sgRange), max(sgInfo['pam coordinate'], sgRange))
 573 | 		if chromatinArray is not None and len(chromatinArray) > 0:
 574 | 			chromatinScores.append(np.nanmean(chromatinArray))
 575 | 		else: #often chrY when using K562 data..
 576 | 			# print name
 577 | 			# print chrom, min(sgInfo['pam coordinate'], sgRange), max(sgInfo['pam coordinate'], sgRange)
 578 | 			chromatinScores.append(np.nan)
 579 | 
 580 | 	chromatinSeries = pd.Series(chromatinScores, index=libraryTable.index, name = colname)
 581 | 
 582 | 	return chromatinSeries.fillna(naValue)
 583 | 
 584 | def getChromatinDataSeriesByGene(bigwigFileHandle, libraryTable, sgInfoTable, p1p2Table, sgrnaDistanceTable_p1p2, colname = '', naValue = 0, normWindow = 1000):
 585 | 	bwindex = bigwigFileHandle #BigWigFile(open(bigwigFile))
 586 | 
 587 | 	chromatinScores = []
 588 | 	for (gene, transcript), sgInfoGroup in sgInfoTable.groupby([sgrnaDistanceTable_p1p2['gene'], sgrnaDistanceTable_p1p2['transcript']]):   
 589 | 		tssRow = p1p2Table.loc[[(gene, transcript)]].iloc[0,:]
 590 | 
 591 | 		chrom = tssRow['chromosome']
 592 | 
 593 | 		normWindowArray = bwindex.get_as_array(chrom, max(0, tssRow['primary TSS'][0] - normWindow), tssRow['primary TSS'][0] + normWindow)
 594 | 		if normWindowArray is not None:
 595 | 			normFactor = np.nanmax(normWindowArray)
 596 | 		else:
 597 | 			normFactor = 1
 598 | 
 599 | 		windowMin = max(0, min(sgInfoGroup['pam coordinate']) - max(sgInfoGroup['length']) - 10)
 600 | 		windowMax = max(sgInfoGroup['pam coordinate']) + max(sgInfoGroup['length']) + 10
 601 | 		chromatinWindow = bwindex.get_as_array(chrom, windowMin, windowMax)
 602 | 
 603 | 		chromatinScores.append(sgInfoGroup.apply(lambda row: getChromatinData(row, chromatinWindow, windowMin, normFactor), axis=1))
 604 | 
 605 | 
 606 | 	chromatinSeries = pd.concat(chromatinScores)
 607 | 
 608 | 	return chromatinSeries.fillna(naValue)
 609 | 
 610 | def getChromatinData(sgInfoRow, chromatinWindowArray, windowMin, normFactor):
 611 | 	if sgInfoRow['strand'] == '+':
 612 | 		sgRange = sgInfoRow['pam coordinate'] + sgInfoRow['length']
 613 | 	else:
 614 | 		sgRange = sgInfoRow['pam coordinate'] - sgInfoRow['length']
 615 | 
 616 | 
 617 | 	if chromatinWindowArray is not None:# and len(chromatinWindowArray) > 0:
 618 | 		chromatinArray = chromatinWindowArray[min(sgInfoRow['pam coordinate'], sgRange) - windowMin: max(sgInfoRow['pam coordinate'], sgRange) - windowMin]
 619 | 		return np.nanmean(chromatinArray)/normFactor
 620 | 	else: #often chrY when using K562 data..
 621 | 		# print name
 622 | 		# print chrom, min(sgInfo['pam coordinate'], sgRange), max(sgInfo['pam coordinate'], sgRange)
 623 | 		return np.nan
 624 | 
 625 | def generateTypicalParamTable(libraryTable, sgInfoTable, tssTable, p1p2Table, genomeDict, bwFileHandleDict, transcripts=False):
 626 | 	lengthSeries = generateSgrnaLengthSeries(libraryTable)
 627 | 
 628 | 	# sgrnaPositionTable = generateSgrnaDistanceTable(sgInfoTable, tssTable, libraryTable)
 629 | 	sgrnaPositionTable_p1p2 = generateSgrnaDistanceTable_p1p2Strategy(sgInfoTable, libraryTable, p1p2Table, transcripts)
 630 | 
 631 | 	baseTable, strand = generateRelativeBasesAndStrand(sgInfoTable, tssTable, libraryTable, genomeDict)
 632 | 	booleanBaseTable = generateBooleanBaseTable(baseTable)
 633 | 	doubleBaseTable = generateBooleanDoubleBaseTable(baseTable)
 634 | 
 635 | 	printNow('.')
 636 | 	baseList = ['A','G','C','T']
 637 | 	homopolymerTable = pd.concat([libraryTable.apply(lambda row: np.floor(getMaxLengthHomopolymer(row['sequence'], base)), axis=1) for base in baseList],keys=baseList,axis=1)
 638 | 
 639 | 	baseFractions = pd.concat([libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['A']),axis=1),
 640 | 							libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['G']),axis=1),
 641 | 							libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['C']),axis=1),
 642 | 							libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['T']),axis=1),
 643 | 							libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['G','C']),axis=1),
 644 | 							libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['G','A']),axis=1),
 645 | 							libraryTable.apply(lambda row: getFractionBaseList(row['sequence'], ['C','A']),axis=1)],keys=['A','G','C','T','GC','purine','CA'],axis=1)
 646 | 
 647 | 	printNow('.')
 648 | 
 649 | 	dnaseSeries = getChromatinDataSeriesByGene(bwFileHandleDict['dnase'], libraryTable, sgInfoTable, p1p2Table, sgrnaPositionTable_p1p2)
 650 | 	printNow('.')
 651 | 	faireSeries = getChromatinDataSeriesByGene(bwFileHandleDict['faire'], libraryTable, sgInfoTable, p1p2Table, sgrnaPositionTable_p1p2)
 652 | 	printNow('.')
 653 | 	mnaseSeries = getChromatinDataSeriesByGene(bwFileHandleDict['mnase'], libraryTable, sgInfoTable, p1p2Table, sgrnaPositionTable_p1p2)
 654 | 	printNow('.')
 655 | 
 656 | 	rnafolding = getRNAfoldingTable(libraryTable)
 657 | 
 658 | 	printNow('Done!')
 659 | 
 660 | 	return pd.concat([lengthSeries,
 661 | 		   sgrnaPositionTable_p1p2.iloc[:,2:],
 662 | 		   homopolymerTable,
 663 | 		   baseFractions,
 664 | 		   strand,
 665 | 		   booleanBaseTable['A'],
 666 | 		   booleanBaseTable['T'],
 667 | 		   booleanBaseTable['G'],
 668 | 		   booleanBaseTable['C'],
 669 | 		   doubleBaseTable,
 670 | 		   pd.concat([dnaseSeries,faireSeries,mnaseSeries],keys=['DNase','FAIRE','MNase'], axis=1),
 671 | 		   rnafolding['no scaffold'],
 672 | 		   rnafolding['with scaffold'],
 673 | 		   rnafolding['is Paired']],keys=['length',
 674 | 		   'distance',
 675 | 		   'homopolymers',
 676 | 		   'base fractions',
 677 | 		   'strand',
 678 | 		   'base table-A',
 679 | 		   'base table-T',
 680 | 		   'base table-G',
 681 | 		   'base table-C',
 682 | 		   'base dimers',
 683 | 		   'accessibility',
 684 | 		   'RNA folding-no scaffold',
 685 | 		   'RNA folding-with scaffold',
 686 | 		   'RNA folding-pairing, no scaffold'],axis=1)
 687 | 
 688 | # def generateTypicalParamTable_parallel(libraryTable, sgInfoTable, tssTable, p1p2Table, genomeDict, bwFileHandleDict, processors):
 689 | #   processPool = multiprocessing.Pool(processors)
 690 | 
 691 | #   colTupList = zip([group for gene, group in libraryTable.groupby(libraryTable['gene'])],
 692 | #       [group for gene, group in sgInfoTable.groupby(libraryTable['gene'])])
 693 | 
 694 | #   result = processPool.map(lambda colTup: generateTypicalParamTable(colTup[0], colTup[1], tssTable, p1p2Table, genomeDict,bwFileHandleDict), colTupList)
 695 | 
 696 | #   return pd.concat(result)
 697 | 
 698 | ###############################################################################
 699 | #                           Learn Parameter Weights                           #
 700 | ###############################################################################
 701 | def fitParams(paramTable, scoreTable, fitTable):
 702 | 	predictedParams = []
 703 | 	estimators = []
 704 | 
 705 | 	for i, (name, col) in enumerate(paramTable.iteritems()):
 706 | 		
 707 | 		fitRow = fitTable.iloc[i]
 708 | 
 709 | 		if fitRow['type'] == 'binary': #binary parameter
 710 | 			# print name, 'is binary parameter'
 711 | 			predictedParams.append(col)
 712 | 			estimators.append('binary')
 713 | 			
 714 | 		elif fitRow['type'] == 'continuous':
 715 | 			col_reshape = col.values.reshape(len(col),1)
 716 | 			parameters = fitRow['params'] 
 717 | 			
 718 | 			svr = svm.SVR(cache_size=500)
 719 | 			clf = grid_search.GridSearchCV(svr, parameters, n_jobs=16, verbose=0)
 720 | 			clf.fit(col_reshape, scoreTable)
 721 | 
 722 | 			print name, clf.best_params_
 723 | 			predictedParams.append(pd.Series(clf.predict(col_reshape), index=col.index, name=name))
 724 | 			estimators.append(clf.best_estimator_)
 725 | 
 726 | 		elif fitRow['type'] == 'binnable':
 727 | 			parameters = fitRow['params']
 728 | 			
 729 | 			assignedBins = binValues(col, parameters['bin width'], parameters['min edge data'])
 730 | 			groupStats = scoreTable.groupby(assignedBins).agg(parameters['bin function'])
 731 | 			
 732 | 			# print name
 733 | 			# print pd.concat((groupStats,scoreTable.groupby(assignedBins).size()), axis=1)
 734 | 			
 735 | 			binnedScores = assignedBins.apply(lambda binVal: groupStats.loc[binVal])
 736 | 			
 737 | 			predictedParams.append(binnedScores)
 738 | 			estimators.append(groupStats)
 739 | 
 740 | 		elif fitRow['type'] == 'binnable_onehot':
 741 | 			parameters = fitRow['params']
 742 | 
 743 | 			assignedBins = binValues(col, parameters['bin width'], parameters['min edge data'])
 744 | 			binGroups = scoreTable.groupby(assignedBins) 
 745 | 			groupStats = binGroups.agg(parameters['bin function'])
 746 | 
 747 | #             print name
 748 | #             print pd.concat((groupStats,scoreTable.groupby(assignedBins).size()), axis=1)
 749 | 			
 750 | 			oneHotFrame = pd.DataFrame(np.zeros((len(assignedBins),len(binGroups))), index = assignedBins.index, \
 751 | 				columns=pd.MultiIndex.from_tuples([(name[0],', '.join([name[1],key])) for key in sorted(binGroups.groups.keys())]))
 752 | 
 753 | 			for groupName, group in binGroups:
 754 | 				oneHotFrame.loc[group.index, (name[0],', '.join([name[1],groupName]))] = 1
 755 | 
 756 | 			predictedParams.append(oneHotFrame)
 757 | 			estimators.append(groupStats)
 758 | 		
 759 | 		else:
 760 | 			raise ValueError(fitRow['type'] + 'not implemented')
 761 | 
 762 | 	return pd.concat(predictedParams, axis=1), estimators
 763 | 
 764 | def binValues(col, binsize, minEdgePoints=0, edgeOffset = None):
 765 | 	bins = np.floor(col / binsize) * binsize
 766 | 	
 767 | 	if minEdgePoints <= 0:
 768 | 		if edgeOffset == None:
 769 | 			return bins.apply(lambda binVal: str(binVal))
 770 | 		else:
 771 | 			return bins
 772 | 	elif minEdgePoints >= len(col):
 773 | 		raise ValueError('too few data points to meet minimum edge requirements')
 774 | 	else:
 775 | 		binGroups = bins.groupby(bins)
 776 | 		binCounts = binGroups.agg(len).sort_index()
 777 | 		
 778 | 		i = 0
 779 | 		leftBin = []
 780 | 		if binCounts.iloc[i] < minEdgePoints:
 781 | 			leftCount = 0
 782 | 			while leftCount < minEdgePoints:
 783 | 				leftCount += binCounts.iloc[i]
 784 | 				leftBin.append(binCounts.index[i])
 785 | 				i += 1
 786 | 			
 787 | 			leftLessThan = binCounts.index[i]
 788 | 		
 789 | 		j = -1
 790 | 		rightBin = []
 791 | 		if binCounts.iloc[j] < minEdgePoints:
 792 | 			rightCount = 0
 793 | 			while rightCount < minEdgePoints:
 794 | 				rightBin.append(binCounts.index[j])
 795 | 				rightCount += binCounts.iloc[j]
 796 | 				j -= 1
 797 | 
 798 | 			rightMoreThan = binCounts.index[j + 1]
 799 | 		
 800 | 		if i > len(binCounts) + j:
 801 | 			raise ValueError('min edge requirements cannot be met')
 802 | 		
 803 | 		if edgeOffset == None: #return strings for bins, fine for grouping, problems for plotting
 804 | 			return bins.apply(lambda binVal: '< %f' % leftLessThan if binVal in leftBin else('>= %f' % rightMoreThan if binVal in rightBin else str(binVal)))
 805 | 		else: #apply arbitrary offset instead to ease plotting
 806 | 			return bins.apply(lambda binVal: leftLessThan - edgeOffset if binVal in leftBin else(rightMoreThan + edgeOffset if binVal in rightBin else binVal))
 807 | 		
 808 | def transformParams(paramTable, fitTable, estimators):
 809 | 	transformedParams = []
 810 | 	
 811 | 	for i, (name, col) in enumerate(paramTable.iteritems()):
 812 | 		fitRow = fitTable.iloc[i]
 813 | 		
 814 | 		if fitRow['type'] == 'binary':
 815 | 			transformedParams.append(col)
 816 | 		elif fitRow['type'] == 'continuous':
 817 | 			col_reshape = col.values.reshape(len(col),1)
 818 | 			transformedParams.append(pd.Series(estimators[i].predict(col_reshape), index=col.index, name=name))
 819 | 		elif fitRow['type'] == 'binnable':
 820 | 			binStats = estimators[i]
 821 | 			assignedBins = applyBins(col, binStats.index.values)
 822 | 			transformedParams.append(assignedBins.apply(lambda binVal: binStats.loc[binVal]))
 823 | 
 824 | 		elif fitRow['type'] == 'binnable_onehot':
 825 | 			binStats = estimators[i]
 826 | 			
 827 | 			assignedBins = applyBins(col, binStats.index.values)
 828 | 			binGroups = col.groupby(assignedBins)
 829 | 
 830 | #             print name
 831 | #             print pd.concat((groupStats,scoreTable.groupby(assignedBins).size()), axis=1)
 832 | 			
 833 | 			oneHotFrame = pd.DataFrame(np.zeros((len(assignedBins),len(binGroups))), index = assignedBins.index, \
 834 | 				columns=pd.MultiIndex.from_tuples([(name[0],', '.join([name[1],key])) for key in sorted(binGroups.groups.keys())]))
 835 | 
 836 | 			for groupName, group in binGroups:
 837 | 				oneHotFrame.loc[group.index, (name[0],', '.join([name[1],groupName]))] = 1
 838 | 
 839 | 			transformedParams.append(oneHotFrame)
 840 | 			
 841 | 	return pd.concat(transformedParams, axis=1)
 842 | 
 843 | def applyBins(column, binStrings):
 844 | 	leftLabel = ''
 845 | 	rightLabel = ''
 846 | 	binTups = []
 847 | 	for binVal in binStrings:
 848 | 		if binVal[0] == '<':
 849 | 			leftLabel = binVal
 850 | 		elif binVal[0] == '>':
 851 | 			rightLabel = binVal
 852 | 			rightBound = float(binVal[3:])
 853 | 		else:
 854 | 			binTups.append((float(binVal),binVal))
 855 | 			
 856 | 	binTups.sort()
 857 | #     print binTups
 858 | 	leftBound = binTups[0][0]
 859 | 	if leftLabel == '':
 860 | 		leftLabel = binTups[0][1]
 861 | 	
 862 | 	if rightLabel == '':
 863 | 		rightLabel = binTups[-1][1]
 864 | 		rightBound = binTups[-1][0]
 865 | 
 866 | 	def binFunc(val):
 867 | 		return leftLabel if val < leftBound else (rightLabel if val >= rightBound else [tup[1] for tup in binTups if val >= tup[0]][-1])
 868 | 
 869 | 	return column.apply(binFunc)
 870 | 
 871 | ###############################################################################
 872 | #                     Predict sgRNA Scores and Library??                      #
 873 | ###############################################################################
 874 | def findAllGuides(p1p2Table, genomeDict, rangeTup, sgRNALength=20):
 875 | 	newLibraryTable = []
 876 | 	newSgInfoTable = []
 877 | 
 878 | 	for tssTup, tssRow in p1p2Table.iterrows():
 879 | 		rangeStart = min(min(tssRow['primary TSS']), min(tssRow['secondary TSS'])) + (rangeTup[0] if tssRow['strand'] == '+' else -1 * rangeTup[1])
 880 | 		rangeEnd = max(max(tssRow['primary TSS']), max(tssRow['secondary TSS'])) + (rangeTup[1] if tssRow['strand'] == '+' else -1 * rangeTup[0])
 881 | 
 882 | 		genomeRange = str(genomeDict[tssRow['chromosome']][rangeStart:rangeEnd + 1].seq)
 883 | 
 884 | 		rangeLength = rangeEnd + 1 - rangeStart
 885 | 		for posOffset in range(rangeLength):
 886 | 			if genomeRange[posOffset:posOffset+1+1].upper() == 'CC' \
 887 | 			and posOffset + 3 + sgRNALength < rangeLength \
 888 | 			and 'N' not in genomeRange[posOffset:posOffset+3+sgRNALength].upper():
 889 | 				pamCoord = rangeStart+posOffset
 890 | 				sgId = tssTup[0] + '_' + '+' + '_' + str(pamCoord) + '.' + str(sgRNALength + 3) + '-' + tssTup[1]
 891 | 				gene = tssTup[0]
 892 | 				transcripts = tssTup[1]
 893 | 				rawSequence = genomeRange[posOffset:posOffset+3+sgRNALength]
 894 | 				sequence = 'G' + str(Seq.Seq(rawSequence[3:-1]).reverse_complement())
 895 | 
 896 | 				newLibraryTable.append((sgId, gene, transcripts, sequence, rawSequence))
 897 | 				newSgInfoTable.append((sgId, 'None', gene, sgRNALength + 3, pamCoord, 'not assigned', pamCoord, '+', tssTup[1].split(',')))
 898 | 			
 899 | 			elif genomeRange[posOffset-1:posOffset+1].upper() == 'GG' \
 900 | 			and posOffset - 3 - sgRNALength >= 0 \
 901 | 			and 'N' not in genomeRange[posOffset + 1 - 3 - sgRNALength:posOffset+1].upper():
 902 | 				pamCoord = rangeStart+posOffset
 903 | 				sgId = tssTup[0] + '_' + '-' + '_' + str(pamCoord) + '.' + str(sgRNALength + 3) + '-' + tssTup[1]
 904 | 				gene = tssTup[0]
 905 | 				transcripts = tssTup[1]
 906 | 				rawSequence = genomeRange[posOffset + 1 - 3 - sgRNALength:posOffset+1]
 907 | 				sequence = 'G' + rawSequence[1:-3]
 908 | 
 909 | 				newLibraryTable.append((sgId, gene, transcripts, sequence, rawSequence))
 910 | 				newSgInfoTable.append((sgId, 'None', gene, sgRNALength + 3, pamCoord, 'not assigned', pamCoord, '-', tssTup[1].split(',')))
 911 | 
 912 | 	return pd.DataFrame(newLibraryTable,columns=['sgId','gene','transcripts','sequence', 'genomic sequence']).set_index('sgId'), \
 913 | 		pd.DataFrame(newSgInfoTable, columns=['sgId','Sublibrary','gene_name', 'length', 'pam coordinate','pass_score','position','strand', 'transcript_list']).set_index('sgId')
 914 | 
 915 | ###############################################################################
 916 | #                               Utility Functions                             #
 917 | ###############################################################################
 918 | def getPseudoIndices(table):
 919 | 	return table.apply(lambda row: row.name[0][:6] == 'pseudo', axis=1)
 920 | 
 921 | def loadGencodeData(gencodeGTF, indexByENSG = True):
 922 | 	printNow('Loading annotation file...')
 923 | 	gencodeData = dict()
 924 | 	with open(gencodeGTF) as gencodeFile:
 925 | 		for line in gencodeFile:
 926 | 			if line[0] != '#':
 927 | 				linesplit = line.strip().split('\t')
 928 | 				attrsplit = linesplit[-1].strip('; ').split('; ')
 929 | 				attrdict = {attr.split(' ')[0]:attr.split(' ')[1].strip('\"') for attr in attrsplit if attr[:3] !='tag'}
 930 | 				attrdict['tags'] = [attr.split(' ')[1].strip('\"') for attr in attrsplit if attr[:3] == 'tag']
 931 | 
 932 | 				if indexByENSG:
 933 | 					dictKey = attrdict['gene_id'].split('.')[0]
 934 | 				else:
 935 | 					dictKey = attrdict['gene_name']
 936 | 
 937 | 				#catch y-linked pseudoautosomal genes
 938 | 				if 'PAR' in attrdict['tags'] and linesplit[0] == 'chrY':
 939 | 					continue
 940 | 				
 941 | 				if linesplit[2] == 'gene':# and attrdict['gene_type'] == 'protein_coding':
 942 | 					gencodeData[dictKey] = ([linesplit[0],long(linesplit[3]),long(linesplit[4]),linesplit[6], attrdict],[])
 943 | 				elif linesplit[2] == 'transcript':
 944 | 					gencodeData[dictKey][1].append([linesplit[0],long(linesplit[3]),long(linesplit[4]),linesplit[6], attrdict])
 945 | 
 946 | 	printNow('Done\n')
 947 | 
 948 | 	return gencodeData
 949 | 
 950 | def loadGenomeAsDict(genomeFasta):
 951 | 	printNow('Loading genome file...')
 952 | 	genomeDict = SeqIO.to_dict(SeqIO.parse(genomeFasta,'fasta'))
 953 | 	printNow('Done\n')
 954 | 	return genomeDict
 955 | 
 956 | def loadCageBedData(cageBedFile, matchList = ['p1','p2']):
 957 | 	cageBedDict = {match:dict() for match in matchList}
 958 | 
 959 | 	with open(cageBedFile) as infile:
 960 | 		for line in infile:
 961 | 			linesplit = line.strip().split('\t')
 962 | 			
 963 | 			for name in linesplit[3].split(','):
 964 | 				namesplit = name.split('@')
 965 | 				if len(namesplit) == 2:
 966 | 					for match in matchList:
 967 | 						if namesplit[0] == match:
 968 | 							cageBedDict[match][namesplit[1]] = linesplit
 969 | 
 970 | 	return cageBedDict
 971 | 
 972 | def matchPeakName(peakName, geneAliasList, promoterRank):
 973 | 	for peakString in peakName.split(','):
 974 | 		peakSplit = peakString.split('@')
 975 | 		
 976 | 		if len(peakSplit) == 2\
 977 | 		and peakSplit[0] == promoterRank\
 978 | 		and peakSplit[1] in geneAliasList:
 979 | 			return True
 980 | 		
 981 | 		if len(peakSplit) > 2:
 982 | 			print peakName
 983 | 		
 984 | 	return False
 985 | 
 986 | def generateAliasDict(hgncFile, gencodeData):
 987 | 	hgncTable = pd.read_csv(hgncFile,sep='\t', header=0).fillna('')
 988 | 
 989 | 	geneToAliases = dict()
 990 | 	geneToENSG = dict()
 991 | 
 992 | 	for i, row in hgncTable.iterrows():
 993 | 		geneToAliases[row['Approved Symbol']] = [row['Approved Symbol']]
 994 | 		geneToAliases[row['Approved Symbol']].extend([] if len(row['Previous Symbols']) == 0 else [name.strip() for name in row['Previous Symbols'].split(',')])
 995 | 		geneToAliases[row['Approved Symbol']].extend([] if len(row['Synonyms']) == 0 else [name.strip() for name in row['Synonyms'].split(',')])
 996 | 
 997 | 		geneToENSG[row['Approved Symbol']] = row['Ensembl Gene ID']
 998 | 
 999 | 	# for gene in gencodeData:
1000 | 	#   if gene not in geneToAliases:
1001 | 	#       geneToAliases[gene] = [gene]
1002 | 		
1003 | 	#   geneToAliases[gene].extend([tr[-1]['transcript_id'].split('.')[0] for tr in gencodeData[gene][1]])
1004 | 
1005 | 	return geneToAliases, geneToENSG
1006 | 
1007 | #Parse information from the sgRNA ID standard format
1008 | def parseSgId(sgId):
1009 |     parseDict = dict()
1010 |     
1011 |     #sublibrary
1012 |     if len(sgId.split('=')) == 2:
1013 |         parseDict['Sublibrary'] = sgId.split('=')[0]
1014 |         remainingId = sgId.split('=')[1]
1015 |     else:
1016 |         parseDict['Sublibrary'] = None
1017 |         remainingId = sgId
1018 |         
1019 |     #gene name and strand
1020 |     underscoreSplit = remainingId.split('_')
1021 |     
1022 |     for i,item in enumerate(underscoreSplit):
1023 |         if item == '+':
1024 |             strand = '+'
1025 |             geneName = '_'.join(underscoreSplit[:i])
1026 |             remainingId = '_'.join(underscoreSplit[i+1:])
1027 |             break
1028 |         elif item == '-':
1029 |             strand = '-'
1030 |             geneName = '_'.join(underscoreSplit[:i])
1031 |             remainingId = '_'.join(underscoreSplit[i+1:])
1032 |             break
1033 |         else:
1034 |             continue
1035 |             
1036 |     parseDict['strand'] = strand
1037 |     parseDict['gene_name'] = geneName
1038 |         
1039 |     #position
1040 |     dotSplit = remainingId.split('.')
1041 |     parseDict['position'] = int(dotSplit[0])
1042 |     remainingId = '.'.join(dotSplit[1:])
1043 |     
1044 |     #length incl pam
1045 |     dashSplit = remainingId.split('-')
1046 |     parseDict['length'] = int(dashSplit[0])
1047 |     remainingId = '-'.join(dashSplit[1:])
1048 |     
1049 |     #pass score
1050 |     tildaSplit = remainingId.split('~')
1051 |     parseDict['pass_score'] = tildaSplit[-1]
1052 |     remainingId = '~'.join(tildaSplit[:-1]) #should always be length 1 anyway
1053 |     
1054 |     #transcripts
1055 |     parseDict['transcript_list'] = remainingId.split(',')
1056 |     
1057 |     return parseDict
1058 | 
1059 | def parseAllSgIds(libraryTable):
1060 | 	sgInfoList = []
1061 | 	for sgId, row in libraryTable.iterrows():
1062 | 		sgInfo = parseSgId(sgId)
1063 | 
1064 | 		#fix pam coordinates for -strand ??
1065 | 		if sgInfo['strand'] == '-':
1066 | 			sgInfo['pam coordinate'] = sgInfo['position'] #+ 1
1067 | 		else:
1068 | 			sgInfo['pam coordinate'] = sgInfo['position']
1069 | 
1070 | 		sgInfoList.append(sgInfo)
1071 | 
1072 | 	return pd.DataFrame(sgInfoList, index=libraryTable.index)
1073 | 
1074 | def printNow(outputString):
1075 | 	sys.stdout.write(outputString)
1076 | 	sys.stdout.flush()
1077 | 


--------------------------------------------------------------------------------
/Library_design_walkthrough.md:
--------------------------------------------------------------------------------
   1 | 
   2 | 1. Learning sgRNA predictors from empirical data
   3 |     * Load scripts and empirical data
   4 |     * Generate TSS annotation using FANTOM dataset
   5 |     * Calculate parameters for empirical sgRNAs
   6 |     * Fit parameters
   7 | 2. Applying machine learning model to predict sgRNA activity
   8 |     * Find all sgRNAs in genomic regions of interest 
   9 |     * Predicting sgRNA activity
  10 | 3. Construct sgRNA libraries
  11 |     * Score sgRNAs for off-target potential
  12 | * Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering
  13 | * Design negative controls matching the base composition of the library
  14 | * Finalizing library design
  15 | 
  16 | # 1. Learning sgRNA predictors from empirical data
  17 | ## Load scripts and empirical data
  18 | 
  19 | 
  20 | ```python
  21 | import sys
  22 | sys.path.insert(0, '../ScreenProcessing/')
  23 | %run sgRNA_learning.py
  24 | ```
  25 | 
  26 | 
  27 | ```python
  28 | genomeDict = loadGenomeAsDict('large_data_files/hg19.fa')
  29 | ```
  30 | 
  31 |     Loading genome file...Done
  32 | 
  33 | 
  34 | 
  35 | ```python
  36 | #to use pre-calculated sgRNA activity score data (e.g. provided CRISPRi training data), load the following:
  37 | #CRISPRa activity score data also included in data_files
  38 | libraryTable_training = pd.read_csv('data_files/CRISPRi_trainingdata_libraryTable.txt', sep='\t', index_col = 0)
  39 | libraryTable_training.head()
  40 | ```
  41 | 
  42 | 
  43 | 
  44 | 
  45 | <div>
  46 | <table border="1" class="dataframe">
  47 |   <thead>
  48 |     <tr style="text-align: right;">
  49 |       <th></th>
  50 |       <th>sublibrary</th>
  51 |       <th>gene</th>
  52 |       <th>transcripts</th>
  53 |       <th>sequence</th>
  54 |     </tr>
  55 |     <tr>
  56 |       <th>sgId</th>
  57 |       <th></th>
  58 |       <th></th>
  59 |       <th></th>
  60 |       <th></th>
  61 |     </tr>
  62 |   </thead>
  63 |   <tbody>
  64 |     <tr>
  65 |       <th>Drug_Targets+Kinase_Phosphatase=AARS_+_70323216.24-all~e39m1</th>
  66 |       <td>drug_targets+kinase_phosphatase</td>
  67 |       <td>AARS</td>
  68 |       <td>all</td>
  69 |       <td>GCCCCAGGATCAGGCCCCGCG</td>
  70 |     </tr>
  71 |     <tr>
  72 |       <th>Drug_Targets+Kinase_Phosphatase=AARS_+_70323296.24-all~e39m1</th>
  73 |       <td>drug_targets+kinase_phosphatase</td>
  74 |       <td>AARS</td>
  75 |       <td>all</td>
  76 |       <td>GGCCGCCCTCGGAGAGCTCTG</td>
  77 |     </tr>
  78 |     <tr>
  79 |       <th>Drug_Targets+Kinase_Phosphatase=AARS_+_70323318.24-all~e39m1</th>
  80 |       <td>drug_targets+kinase_phosphatase</td>
  81 |       <td>AARS</td>
  82 |       <td>all</td>
  83 |       <td>GACGGCGACCCTAGGAGAGGT</td>
  84 |     </tr>
  85 |     <tr>
  86 |       <th>Drug_Targets+Kinase_Phosphatase=AARS_+_70323362.24-all~e39m1</th>
  87 |       <td>drug_targets+kinase_phosphatase</td>
  88 |       <td>AARS</td>
  89 |       <td>all</td>
  90 |       <td>GGTGCAGCGGGCCCTTGGCGG</td>
  91 |     </tr>
  92 |     <tr>
  93 |       <th>Drug_Targets+Kinase_Phosphatase=AARS_+_70323441.24-all~e39m1</th>
  94 |       <td>drug_targets+kinase_phosphatase</td>
  95 |       <td>AARS</td>
  96 |       <td>all</td>
  97 |       <td>GCGCTCTGATTGGACGGAGCG</td>
  98 |     </tr>
  99 |   </tbody>
 100 | </table>
 101 | </div>
 102 | 
 103 | 
 104 | 
 105 | 
 106 | ```python
 107 | sgInfoTable_training = pd.read_csv('data_files/CRISPRi_trainingdata_sgRNAInfoTable.txt', sep='\t', index_col=0)
 108 | sgInfoTable_training.head()
 109 | ```
 110 | 
 111 | 
 112 | 
 113 | 
 114 | <div>
 115 | <table border="1" class="dataframe">
 116 |   <thead>
 117 |     <tr style="text-align: right;">
 118 |       <th></th>
 119 |       <th>Sublibrary</th>
 120 |       <th>gene_name</th>
 121 |       <th>length</th>
 122 |       <th>pam coordinate</th>
 123 |       <th>pass_score</th>
 124 |       <th>position</th>
 125 |       <th>strand</th>
 126 |       <th>transcript_list</th>
 127 |     </tr>
 128 |     <tr>
 129 |       <th>sgId</th>
 130 |       <th></th>
 131 |       <th></th>
 132 |       <th></th>
 133 |       <th></th>
 134 |       <th></th>
 135 |       <th></th>
 136 |       <th></th>
 137 |       <th></th>
 138 |     </tr>
 139 |   </thead>
 140 |   <tbody>
 141 |     <tr>
 142 |       <th>Drug_Targets+Kinase_Phosphatase=AARS_+_70323216.24-all~e39m1</th>
 143 |       <td>Drug_Targets+Kinase_Phosphatase</td>
 144 |       <td>AARS</td>
 145 |       <td>24</td>
 146 |       <td>70323216</td>
 147 |       <td>e39m1</td>
 148 |       <td>70323216</td>
 149 |       <td>+</td>
 150 |       <td>['all']</td>
 151 |     </tr>
 152 |     <tr>
 153 |       <th>Drug_Targets+Kinase_Phosphatase=AARS_+_70323296.24-all~e39m1</th>
 154 |       <td>Drug_Targets+Kinase_Phosphatase</td>
 155 |       <td>AARS</td>
 156 |       <td>24</td>
 157 |       <td>70323296</td>
 158 |       <td>e39m1</td>
 159 |       <td>70323296</td>
 160 |       <td>+</td>
 161 |       <td>['all']</td>
 162 |     </tr>
 163 |     <tr>
 164 |       <th>Drug_Targets+Kinase_Phosphatase=AARS_+_70323318.24-all~e39m1</th>
 165 |       <td>Drug_Targets+Kinase_Phosphatase</td>
 166 |       <td>AARS</td>
 167 |       <td>24</td>
 168 |       <td>70323318</td>
 169 |       <td>e39m1</td>
 170 |       <td>70323318</td>
 171 |       <td>+</td>
 172 |       <td>['all']</td>
 173 |     </tr>
 174 |     <tr>
 175 |       <th>Drug_Targets+Kinase_Phosphatase=AARS_+_70323362.24-all~e39m1</th>
 176 |       <td>Drug_Targets+Kinase_Phosphatase</td>
 177 |       <td>AARS</td>
 178 |       <td>24</td>
 179 |       <td>70323362</td>
 180 |       <td>e39m1</td>
 181 |       <td>70323362</td>
 182 |       <td>+</td>
 183 |       <td>['all']</td>
 184 |     </tr>
 185 |     <tr>
 186 |       <th>Drug_Targets+Kinase_Phosphatase=AARS_+_70323441.24-all~e39m1</th>
 187 |       <td>Drug_Targets+Kinase_Phosphatase</td>
 188 |       <td>AARS</td>
 189 |       <td>24</td>
 190 |       <td>70323441</td>
 191 |       <td>e39m1</td>
 192 |       <td>70323441</td>
 193 |       <td>+</td>
 194 |       <td>['all']</td>
 195 |     </tr>
 196 |   </tbody>
 197 | </table>
 198 | </div>
 199 | 
 200 | 
 201 | 
 202 | 
 203 | ```python
 204 | activityScores = pd.read_csv('data_files/CRISPRi_trainingdata_activityScores.txt',sep='\t',index_col=0, header=None).iloc[:,0]
 205 | activityScores.head()
 206 | ```
 207 | 
 208 | 
 209 | 
 210 | 
 211 |     0
 212 |     Drug_Targets+Kinase_Phosphatase=AARS_+_70323216.24-all~e39m1    0.348892
 213 |     Drug_Targets+Kinase_Phosphatase=AARS_+_70323296.24-all~e39m1    0.912409
 214 |     Drug_Targets+Kinase_Phosphatase=AARS_+_70323318.24-all~e39m1    0.997242
 215 |     Drug_Targets+Kinase_Phosphatase=AARS_+_70323362.24-all~e39m1    0.962154
 216 |     Drug_Targets+Kinase_Phosphatase=AARS_+_70323441.24-all~e39m1    0.019320
 217 |     Name: 1, dtype: float64
 218 | 
 219 | 
 220 | 
 221 | 
 222 | ```python
 223 | tssTable = pd.read_csv('data_files/human_tssTable.txt',sep='\t', index_col=range(2))
 224 | tssTable.head()
 225 | ```
 226 | 
 227 | 
 228 | 
 229 | 
 230 | <div>
 231 | <table border="1" class="dataframe">
 232 |   <thead>
 233 |     <tr style="text-align: right;">
 234 |       <th></th>
 235 |       <th></th>
 236 |       <th>position</th>
 237 |       <th>strand</th>
 238 |       <th>chromosome</th>
 239 |       <th>cage peak ranges</th>
 240 |     </tr>
 241 |     <tr>
 242 |       <th>gene</th>
 243 |       <th>transcripts</th>
 244 |       <th></th>
 245 |       <th></th>
 246 |       <th></th>
 247 |       <th></th>
 248 |     </tr>
 249 |   </thead>
 250 |   <tbody>
 251 |     <tr>
 252 |       <th>A1BG</th>
 253 |       <th>all</th>
 254 |       <td>58864864</td>
 255 |       <td>-</td>
 256 |       <td>chr19</td>
 257 |       <td>[(58864822, 58864847), (58864848, 58864868)]</td>
 258 |     </tr>
 259 |     <tr>
 260 |       <th rowspan="2" valign="top">A1CF</th>
 261 |       <th>ENST00000373993.1</th>
 262 |       <td>52619744</td>
 263 |       <td>-</td>
 264 |       <td>chr10</td>
 265 |       <td>[]</td>
 266 |     </tr>
 267 |     <tr>
 268 |       <th>ENST00000374001.2,ENST00000373997.3,ENST00000282641.2</th>
 269 |       <td>52645434</td>
 270 |       <td>-</td>
 271 |       <td>chr10</td>
 272 |       <td>[(52645379, 52645393), (52645416, 52645444)]</td>
 273 |     </tr>
 274 |     <tr>
 275 |       <th>A2M</th>
 276 |       <th>all</th>
 277 |       <td>9268752</td>
 278 |       <td>-</td>
 279 |       <td>chr12</td>
 280 |       <td>[(9268547, 9268556), (9268559, 9268568), (9268...</td>
 281 |     </tr>
 282 |     <tr>
 283 |       <th>A2ML1</th>
 284 |       <th>all</th>
 285 |       <td>8975067</td>
 286 |       <td>+</td>
 287 |       <td>chr12</td>
 288 |       <td>[(8975061, 8975072), (8975101, 8975108), (8975...</td>
 289 |     </tr>
 290 |   </tbody>
 291 | </table>
 292 | </div>
 293 | 
 294 | 
 295 | 
 296 | 
 297 | ```python
 298 | p1p2Table = pd.read_csv('data_files/human_p1p2Table.txt',sep='\t', header=0, index_col=range(2))
 299 | p1p2Table['primary TSS'] = p1p2Table['primary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0].split('.')[0]), int(tupString.strip('()').split(', ')[1].split('.')[0])))
 300 | p1p2Table['secondary TSS'] = p1p2Table['secondary TSS'].apply(lambda tupString: (int(tupString.strip('()').split(', ')[0].split('.')[0]),int(tupString.strip('()').split(', ')[1].split('.')[0])))
 301 | p1p2Table.head()
 302 | ```
 303 | 
 304 | 
 305 | 
 306 | 
 307 | <div>
 308 | <table border="1" class="dataframe">
 309 |   <thead>
 310 |     <tr style="text-align: right;">
 311 |       <th></th>
 312 |       <th></th>
 313 |       <th>chromosome</th>
 314 |       <th>strand</th>
 315 |       <th>TSS source</th>
 316 |       <th>primary TSS</th>
 317 |       <th>secondary TSS</th>
 318 |     </tr>
 319 |     <tr>
 320 |       <th>gene</th>
 321 |       <th>transcript</th>
 322 |       <th></th>
 323 |       <th></th>
 324 |       <th></th>
 325 |       <th></th>
 326 |       <th></th>
 327 |     </tr>
 328 |   </thead>
 329 |   <tbody>
 330 |     <tr>
 331 |       <th rowspan="2" valign="top">A1BG</th>
 332 |       <th>P1</th>
 333 |       <td>chr19</td>
 334 |       <td>-</td>
 335 |       <td>CAGE, matched peaks</td>
 336 |       <td>(58858938, 58859039)</td>
 337 |       <td>(58858938, 58859039)</td>
 338 |     </tr>
 339 |     <tr>
 340 |       <th>P2</th>
 341 |       <td>chr19</td>
 342 |       <td>-</td>
 343 |       <td>CAGE, matched peaks</td>
 344 |       <td>(58864822, 58864847)</td>
 345 |       <td>(58864822, 58864847)</td>
 346 |     </tr>
 347 |     <tr>
 348 |       <th>A1CF</th>
 349 |       <th>P1P2</th>
 350 |       <td>chr10</td>
 351 |       <td>-</td>
 352 |       <td>CAGE, matched peaks</td>
 353 |       <td>(52645379, 52645393)</td>
 354 |       <td>(52645379, 52645393)</td>
 355 |     </tr>
 356 |     <tr>
 357 |       <th>A2M</th>
 358 |       <th>P1P2</th>
 359 |       <td>chr12</td>
 360 |       <td>-</td>
 361 |       <td>CAGE, matched peaks</td>
 362 |       <td>(9268507, 9268523)</td>
 363 |       <td>(9268528, 9268542)</td>
 364 |     </tr>
 365 |     <tr>
 366 |       <th>A2ML1</th>
 367 |       <th>P1P2</th>
 368 |       <td>chr12</td>
 369 |       <td>+</td>
 370 |       <td>CAGE, matched peaks</td>
 371 |       <td>(8975206, 8975223)</td>
 372 |       <td>(8975144, 8975169)</td>
 373 |     </tr>
 374 |   </tbody>
 375 | </table>
 376 | </div>
 377 | 
 378 | 
 379 | 
 380 | ## Calculate parameters for empirical sgRNAs
 381 | 
 382 | ### Because scikit-learn currently does not support any robust method for saving and re-loading the machine learning model, the best strategy is to simply re-learn the model from the training data
 383 | 
 384 | 
 385 | ```python
 386 | #Load bigwig files for any chromatin data of interest
 387 | bwhandleDict = {'dnase':BigWigFile(open('large_data_files/wgEncodeOpenChromDnaseK562BaseOverlapSignalV2.bigWig')),
 388 | 'faire':BigWigFile(open('large_data_files/wgEncodeOpenChromFaireK562Sig.bigWig')),
 389 | 'mnase':BigWigFile(open('large_data_files/wgEncodeSydhNsomeK562Sig.bigWig'))}
 390 | ```
 391 | 
 392 | 
 393 | ```python
 394 | paramTable_trainingGuides = generateTypicalParamTable(libraryTable_training,sgInfoTable_training, tssTable, p1p2Table, genomeDict, bwhandleDict)
 395 | ```
 396 | 
 397 |     ....
 398 | 
 399 |     /usr/local/lib/python2.7/dist-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
 400 |       warnings.warn("Mean of empty slice", RuntimeWarning)
 401 |     /usr/local/lib/python2.7/dist-packages/numpy/lib/nanfunctions.py:326: RuntimeWarning: All-NaN slice encountered
 402 |       warnings.warn("All-NaN slice encountered", RuntimeWarning)
 403 | 
 404 | 
 405 |     .Done!
 406 | 
 407 | ## Fit parameters
 408 | 
 409 | 
 410 | ```python
 411 | #load in the 5-fold cross-validation splits used to generate the model
 412 | import cPickle
 413 | with open('data_files/CRISPRi_trainingdata_traintestsets.txt') as infile:
 414 |     geneFold_train, geneFold_test, fitTable = cPickle.load(infile)
 415 | ```
 416 | 
 417 | 
 418 | ```python
 419 | transformedParams_train, estimators = fitParams(paramTable_trainingGuides.loc[activityScores.dropna().index].iloc[geneFold_train], activityScores.loc[activityScores.dropna().index].iloc[geneFold_train], fitTable)
 420 | 
 421 | transformedParams_test = transformParams(paramTable_trainingGuides.loc[activityScores.dropna().index].iloc[geneFold_test], fitTable, estimators)
 422 | 
 423 | reg = linear_model.ElasticNetCV(l1_ratio=[.5, .75, .9, .99,1], n_jobs=16, max_iter=2000)
 424 | 
 425 | scaler = preprocessing.StandardScaler()
 426 | reg.fit(scaler.fit_transform(transformedParams_train), activityScores.loc[activityScores.dropna().index].iloc[geneFold_train])
 427 | predictedScores = pd.Series(reg.predict(scaler.transform(transformedParams_test)), index=transformedParams_test.index)
 428 | testScores = activityScores.loc[activityScores.dropna().index].iloc[geneFold_test]
 429 | 
 430 | print 'Prediction AUC-ROC:', metrics.roc_auc_score((testScores >= .75).values, np.array(predictedScores.values,dtype='float64'))
 431 | print 'Prediction R^2:', reg.score(scaler.transform(transformedParams_test), testScores)
 432 | print 'Regression parameters:', reg.l1_ratio_, reg.alpha_
 433 | coefs = pd.DataFrame(zip(*[abs(reg.coef_),reg.coef_]), index = transformedParams_test.columns, columns=['abs','true'])
 434 | print 'Number of features used:', len(coefs) - sum(coefs['abs'] < .00000000001)
 435 | ```
 436 | 
 437 |     ('distance', 'primary TSS-Up') {'C': 0.05, 'gamma': 0.0001}
 438 |     ('distance', 'primary TSS-Down') {'C': 0.5, 'gamma': 5e-05}
 439 |     ('distance', 'secondary TSS-Up') {'C': 0.1, 'gamma': 5e-05}
 440 |     ('distance', 'secondary TSS-Down') {'C': 0.1, 'gamma': 5e-05}
 441 | 
 442 | 
 443 |     /home/mhorlbeck/.local/lib/python2.7/site-packages/sklearn/utils/validation.py:323: UserWarning: StandardScaler assumes floating point values as input, got object
 444 |       "got %s" % (estimator, X.dtype))
 445 |     /home/mhorlbeck/.local/lib/python2.7/site-packages/sklearn/linear_model/base.py:400: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
 446 |       if precompute == 'auto':
 447 | 
 448 | 
 449 |     Prediction AUC-ROC: 0.803109696478
 450 |     Prediction R^2: 0.31263687609
 451 |     Regression parameters: 0.5 0.00534455043278
 452 |     Number of features used: 327
 453 | 
 454 | 
 455 | 
 456 | ```python
 457 | #can save state for reproducing estimators later
 458 | #the pickling of the scikit-learn estimators/regressors will allow the model to be reloaded for prediction of other guide designs, 
 459 | #   but will not be compatible across scikit-learn versions, so it is important to preserve the training data and training/test folds
 460 | import cPickle
 461 | estimatorString = cPickle.dumps((fitTable, estimators, scaler, reg))
 462 | with open(PICKLE_FILE,'w') as outfile:
 463 |     outfile.write(estimatorString)
 464 |     
 465 | #also save the transformed parameters as these can slightly differ based on the automated binning strategy
 466 | transformedParams_train.head().to_csv(TRANSFORMED_PARAM_HEADER,sep='\t')
 467 | ```
 468 | 
 469 | # 2. Applying machine learning model to predict sgRNA activity
 470 | 
 471 | ## Generate TSS annotation using FANTOM dataset
 472 | 
 473 | 
 474 | ```python
 475 | #you can supply any table of gene transcription start sites formatted as below
 476 | #for demonstration purposes, the rest of this walkthrough will use a small arbitrary subset of the protein coding TSS table
 477 | tssTable_new = tssTable.iloc[10:20, :-1]
 478 | tssTable_new.head()
 479 | ```
 480 | 
 481 | 
 482 | 
 483 | 
 484 | <div>
 485 | <table border="1" class="dataframe">
 486 |   <thead>
 487 |     <tr style="text-align: right;">
 488 |       <th></th>
 489 |       <th></th>
 490 |       <th>position</th>
 491 |       <th>strand</th>
 492 |       <th>chromosome</th>
 493 |     </tr>
 494 |     <tr>
 495 |       <th>gene</th>
 496 |       <th>transcripts</th>
 497 |       <th></th>
 498 |       <th></th>
 499 |       <th></th>
 500 |     </tr>
 501 |   </thead>
 502 |   <tbody>
 503 |     <tr>
 504 |       <th>AADACL2</th>
 505 |       <th>all</th>
 506 |       <td>151451714</td>
 507 |       <td>+</td>
 508 |       <td>chr3</td>
 509 |     </tr>
 510 |     <tr>
 511 |       <th rowspan="3" valign="top">AADAT</th>
 512 |       <th>ENST00000337664.4</th>
 513 |       <td>171011117</td>
 514 |       <td>-</td>
 515 |       <td>chr4</td>
 516 |     </tr>
 517 |     <tr>
 518 |       <th>ENST00000337664.4,ENST00000509167.1,ENST00000353187.2</th>
 519 |       <td>171011284</td>
 520 |       <td>-</td>
 521 |       <td>chr4</td>
 522 |     </tr>
 523 |     <tr>
 524 |       <th>ENST00000509167.1,ENST00000515480.1,ENST00000353187.2</th>
 525 |       <td>171011424</td>
 526 |       <td>-</td>
 527 |       <td>chr4</td>
 528 |     </tr>
 529 |     <tr>
 530 |       <th>AAED1</th>
 531 |       <th>all</th>
 532 |       <td>99417562</td>
 533 |       <td>-</td>
 534 |       <td>chr9</td>
 535 |     </tr>
 536 |   </tbody>
 537 | </table>
 538 | </div>
 539 | 
 540 | 
 541 | 
 542 | 
 543 | ```python
 544 | #if desired, use the ensembl annotation and the HGNC database to supply gene aliases to assist P1P2 matching in the next step
 545 | gencodeData = loadGencodeData('large_data_files/gencode.v19.annotation.gtf')
 546 | geneToAliases = generateAliasDict('large_data_files/20150424_HGNC_symbols.txt',gencodeData)
 547 | ```
 548 | 
 549 |     Loading annotation file...Done
 550 | 
 551 | 
 552 | 
 553 | ```python
 554 | #Now create a TSS annotation by searching for P1 and P2 peaks near annotated TSSs
 555 | #same parameters as for our lncRNA libraries
 556 | p1p2Table_new = generateTssTable_P1P2strategy(tssTable_new, 'large_data_files/TSS_human.sorted.bed.gz', 
 557 |                                                   matchedp1p2Window = 30000, #region around supplied TSS annotation to search for a FANTOM P1 or P2 peak that matches the gene name (or alias)
 558 |                                                   anyp1p2Window = 500, #region around supplied TSS annotation to search for the nearest P1 or P2 peak
 559 |                                                   anyPeakWindow = 200, #region around supplied TSS annotation to search for any CAGE peak
 560 |                                                   minDistanceForTwoTSS = 1000, #If a P1 and P2 peak are found, maximum distance at which to combine into a single annotation (with primary/secondary TSS positions)
 561 |                                                   aliasDict = geneToAliases[0])
 562 | #the function will report some collisions of IDs due to use of aliases and redundancy in genome, but will resolve these itself
 563 | ```
 564 | 
 565 | 
 566 | ```python
 567 | p1p2Table_new.head()
 568 | ```
 569 | 
 570 | 
 571 | 
 572 | 
 573 | <div>
 574 | <table border="1" class="dataframe">
 575 |   <thead>
 576 |     <tr style="text-align: right;">
 577 |       <th></th>
 578 |       <th></th>
 579 |       <th>chromosome</th>
 580 |       <th>strand</th>
 581 |       <th>TSS source</th>
 582 |       <th>primary TSS</th>
 583 |       <th>secondary TSS</th>
 584 |     </tr>
 585 |     <tr>
 586 |       <th>gene</th>
 587 |       <th>transcript</th>
 588 |       <th></th>
 589 |       <th></th>
 590 |       <th></th>
 591 |       <th></th>
 592 |       <th></th>
 593 |     </tr>
 594 |   </thead>
 595 |   <tbody>
 596 |     <tr>
 597 |       <th>AADACL2</th>
 598 |       <th>P1P2</th>
 599 |       <td>chr3</td>
 600 |       <td>+</td>
 601 |       <td>CAGE, matched peaks</td>
 602 |       <td>(151451707, 151451722)</td>
 603 |       <td>(151451707, 151451722)</td>
 604 |     </tr>
 605 |     <tr>
 606 |       <th>AADAT</th>
 607 |       <th>P1P2</th>
 608 |       <td>chr4</td>
 609 |       <td>-</td>
 610 |       <td>CAGE, matched peaks</td>
 611 |       <td>(171011323, 171011408)</td>
 612 |       <td>(171011084, 171011147)</td>
 613 |     </tr>
 614 |     <tr>
 615 |       <th>AAED1</th>
 616 |       <th>P1P2</th>
 617 |       <td>chr9</td>
 618 |       <td>-</td>
 619 |       <td>CAGE, matched peaks</td>
 620 |       <td>(99417562, 99417609)</td>
 621 |       <td>(99417615, 99417622)</td>
 622 |     </tr>
 623 |     <tr>
 624 |       <th>AAGAB</th>
 625 |       <th>P1P2</th>
 626 |       <td>chr15</td>
 627 |       <td>-</td>
 628 |       <td>CAGE, matched peaks</td>
 629 |       <td>(67546963, 67547024)</td>
 630 |       <td>(67546963, 67547024)</td>
 631 |     </tr>
 632 |     <tr>
 633 |       <th>AAK1</th>
 634 |       <th>P1P2</th>
 635 |       <td>chr2</td>
 636 |       <td>-</td>
 637 |       <td>CAGE, matched peaks</td>
 638 |       <td>(69870747, 69870812)</td>
 639 |       <td>(69870854, 69870878)</td>
 640 |     </tr>
 641 |   </tbody>
 642 | </table>
 643 | </div>
 644 | 
 645 | 
 646 | 
 647 | 
 648 | ```python
 649 | p1p2Table_new.groupby('TSS source').agg(len).iloc[:,[2]]
 650 | ```
 651 | 
 652 | 
 653 | 
 654 | 
 655 | <div>
 656 | <table border="1" class="dataframe">
 657 |   <thead>
 658 |     <tr style="text-align: right;">
 659 |       <th></th>
 660 |       <th>primary TSS</th>
 661 |     </tr>
 662 |     <tr>
 663 |       <th>TSS source</th>
 664 |       <th></th>
 665 |     </tr>
 666 |   </thead>
 667 |   <tbody>
 668 |     <tr>
 669 |       <th>CAGE, matched peaks</th>
 670 |       <td>8</td>
 671 |     </tr>
 672 |   </tbody>
 673 | </table>
 674 | </div>
 675 | 
 676 | 
 677 | 
 678 | 
 679 | ```python
 680 | len(p1p2Table_new)
 681 | ```
 682 | 
 683 | 
 684 | 
 685 | 
 686 |     8
 687 | 
 688 | 
 689 | 
 690 | 
 691 | ```python
 692 | #save tables
 693 | tssTable_alllncs.to_csv(TSS_TABLE_PATH,sep='\t', index_col=range(2))
 694 | p1p2Table_alllncs.to_csv(P1P2_TABLE_PATH,sep='\t', header=0, index_col=range(2))
 695 | ```
 696 | 
 697 | ## Find all sgRNAs in genomic regions of interest 
 698 | 
 699 | 
 700 | ```python
 701 | libraryTable_new, sgInfoTable_new = findAllGuides(p1p2Table_new, genomeDict, 
 702 |                                                   (-25,500)) #region around P1P2 TSSs to search for new sgRNAs; recommend -550,-25 for CRISPRa
 703 | ```
 704 | 
 705 | 
 706 | ```python
 707 | len(libraryTable_new)
 708 | ```
 709 | 
 710 | 
 711 | 
 712 | 
 713 |     1125
 714 | 
 715 | 
 716 | 
 717 | 
 718 | ```python
 719 | libraryTable_new.head()
 720 | ```
 721 | 
 722 | 
 723 | 
 724 | 
 725 | <div>
 726 | <table border="1" class="dataframe">
 727 |   <thead>
 728 |     <tr style="text-align: right;">
 729 |       <th></th>
 730 |       <th>gene</th>
 731 |       <th>transcripts</th>
 732 |       <th>sequence</th>
 733 |       <th>genomic sequence</th>
 734 |     </tr>
 735 |     <tr>
 736 |       <th>sgId</th>
 737 |       <th></th>
 738 |       <th></th>
 739 |       <th></th>
 740 |       <th></th>
 741 |     </tr>
 742 |   </thead>
 743 |   <tbody>
 744 |     <tr>
 745 |       <th>AADACL2_+_151451720.23-P1P2</th>
 746 |       <td>AADACL2</td>
 747 |       <td>P1P2</td>
 748 |       <td>GTAGACTTGGGAACTCTCTC</td>
 749 |       <td>CCTGAGAGAGTTCCCAAGTCTAC</td>
 750 |     </tr>
 751 |     <tr>
 752 |       <th>AADACL2_+_151451732.23-P1P2</th>
 753 |       <td>AADACL2</td>
 754 |       <td>P1P2</td>
 755 |       <td>GGTAGAGCAATTGTAGACTT</td>
 756 |       <td>CCCAAGTCTACAATTGCTCTACT</td>
 757 |     </tr>
 758 |     <tr>
 759 |       <th>AADACL2_+_151451733.23-P1P2</th>
 760 |       <td>AADACL2</td>
 761 |       <td>P1P2</td>
 762 |       <td>GAGTAGAGCAATTGTAGACT</td>
 763 |       <td>CCAAGTCTACAATTGCTCTACTA</td>
 764 |     </tr>
 765 |     <tr>
 766 |       <th>AADACL2_-_151451809.23-P1P2</th>
 767 |       <td>AADACL2</td>
 768 |       <td>P1P2</td>
 769 |       <td>GCTCAGTACTGTGAAGAAGC</td>
 770 |       <td>TCTCAGTACTGTGAAGAAGCTGG</td>
 771 |     </tr>
 772 |     <tr>
 773 |       <th>AADACL2_-_151451816.23-P1P2</th>
 774 |       <td>AADACL2</td>
 775 |       <td>P1P2</td>
 776 |       <td>GCTGTGAAGAAGCTGGAAAA</td>
 777 |       <td>ACTGTGAAGAAGCTGGAAAAAGG</td>
 778 |     </tr>
 779 |   </tbody>
 780 | </table>
 781 | </div>
 782 | 
 783 | 
 784 | 
 785 | ## Predicting sgRNA activity
 786 | 
 787 | 
 788 | ```python
 789 | #calculate parameters for new sgRNAs
 790 | paramTable_new = generateTypicalParamTable(libraryTable_new, sgInfoTable_new, tssTable_new, p1p2Table_new, genomeDict, bwhandleDict)
 791 | ```
 792 | 
 793 |     .....Done!
 794 | 
 795 | 
 796 | ```python
 797 | paramTable_new.head()
 798 | ```
 799 | 
 800 | 
 801 | 
 802 | 
 803 | <div>
 804 | <table border="1" class="dataframe">
 805 |   <thead>
 806 |     <tr>
 807 |       <th></th>
 808 |       <th>length</th>
 809 |       <th colspan="4" halign="left">distance</th>
 810 |       <th colspan="4" halign="left">homopolymers</th>
 811 |       <th>base fractions</th>
 812 |       <th>...</th>
 813 |       <th colspan="10" halign="left">RNA folding-pairing, no scaffold</th>
 814 |     </tr>
 815 |     <tr>
 816 |       <th></th>
 817 |       <th>length</th>
 818 |       <th>primary TSS-Up</th>
 819 |       <th>primary TSS-Down</th>
 820 |       <th>secondary TSS-Up</th>
 821 |       <th>secondary TSS-Down</th>
 822 |       <th>A</th>
 823 |       <th>G</th>
 824 |       <th>C</th>
 825 |       <th>T</th>
 826 |       <th>A</th>
 827 |       <th>...</th>
 828 |       <th>-12</th>
 829 |       <th>-11</th>
 830 |       <th>-10</th>
 831 |       <th>-9</th>
 832 |       <th>-8</th>
 833 |       <th>-7</th>
 834 |       <th>-6</th>
 835 |       <th>-5</th>
 836 |       <th>-4</th>
 837 |       <th>-3</th>
 838 |     </tr>
 839 |     <tr>
 840 |       <th>sgId</th>
 841 |       <th></th>
 842 |       <th></th>
 843 |       <th></th>
 844 |       <th></th>
 845 |       <th></th>
 846 |       <th></th>
 847 |       <th></th>
 848 |       <th></th>
 849 |       <th></th>
 850 |       <th></th>
 851 |       <th></th>
 852 |       <th></th>
 853 |       <th></th>
 854 |       <th></th>
 855 |       <th></th>
 856 |       <th></th>
 857 |       <th></th>
 858 |       <th></th>
 859 |       <th></th>
 860 |       <th></th>
 861 |       <th></th>
 862 |     </tr>
 863 |   </thead>
 864 |   <tbody>
 865 |     <tr>
 866 |       <th>AADACL2_+_151451720.23-P1P2</th>
 867 |       <td>20</td>
 868 |       <td>13</td>
 869 |       <td>-2</td>
 870 |       <td>13</td>
 871 |       <td>-2</td>
 872 |       <td>2</td>
 873 |       <td>3</td>
 874 |       <td>1</td>
 875 |       <td>2</td>
 876 |       <td>0.20</td>
 877 |       <td>...</td>
 878 |       <td>True</td>
 879 |       <td>True</td>
 880 |       <td>False</td>
 881 |       <td>False</td>
 882 |       <td>False</td>
 883 |       <td>False</td>
 884 |       <td>True</td>
 885 |       <td>True</td>
 886 |       <td>True</td>
 887 |       <td>True</td>
 888 |     </tr>
 889 |     <tr>
 890 |       <th>AADACL2_+_151451732.23-P1P2</th>
 891 |       <td>20</td>
 892 |       <td>25</td>
 893 |       <td>10</td>
 894 |       <td>25</td>
 895 |       <td>10</td>
 896 |       <td>2</td>
 897 |       <td>2</td>
 898 |       <td>1</td>
 899 |       <td>2</td>
 900 |       <td>0.30</td>
 901 |       <td>...</td>
 902 |       <td>False</td>
 903 |       <td>False</td>
 904 |       <td>False</td>
 905 |       <td>False</td>
 906 |       <td>False</td>
 907 |       <td>False</td>
 908 |       <td>False</td>
 909 |       <td>False</td>
 910 |       <td>False</td>
 911 |       <td>False</td>
 912 |     </tr>
 913 |     <tr>
 914 |       <th>AADACL2_+_151451733.23-P1P2</th>
 915 |       <td>20</td>
 916 |       <td>26</td>
 917 |       <td>11</td>
 918 |       <td>26</td>
 919 |       <td>11</td>
 920 |       <td>2</td>
 921 |       <td>1</td>
 922 |       <td>1</td>
 923 |       <td>2</td>
 924 |       <td>0.35</td>
 925 |       <td>...</td>
 926 |       <td>False</td>
 927 |       <td>False</td>
 928 |       <td>False</td>
 929 |       <td>False</td>
 930 |       <td>False</td>
 931 |       <td>False</td>
 932 |       <td>False</td>
 933 |       <td>False</td>
 934 |       <td>False</td>
 935 |       <td>False</td>
 936 |     </tr>
 937 |     <tr>
 938 |       <th>AADACL2_-_151451809.23-P1P2</th>
 939 |       <td>20</td>
 940 |       <td>102</td>
 941 |       <td>87</td>
 942 |       <td>102</td>
 943 |       <td>87</td>
 944 |       <td>2</td>
 945 |       <td>1</td>
 946 |       <td>1</td>
 947 |       <td>1</td>
 948 |       <td>0.30</td>
 949 |       <td>...</td>
 950 |       <td>False</td>
 951 |       <td>True</td>
 952 |       <td>True</td>
 953 |       <td>True</td>
 954 |       <td>False</td>
 955 |       <td>False</td>
 956 |       <td>False</td>
 957 |       <td>False</td>
 958 |       <td>False</td>
 959 |       <td>False</td>
 960 |     </tr>
 961 |     <tr>
 962 |       <th>AADACL2_-_151451816.23-P1P2</th>
 963 |       <td>20</td>
 964 |       <td>109</td>
 965 |       <td>94</td>
 966 |       <td>109</td>
 967 |       <td>94</td>
 968 |       <td>4</td>
 969 |       <td>2</td>
 970 |       <td>1</td>
 971 |       <td>1</td>
 972 |       <td>0.40</td>
 973 |       <td>...</td>
 974 |       <td>True</td>
 975 |       <td>True</td>
 976 |       <td>True</td>
 977 |       <td>False</td>
 978 |       <td>False</td>
 979 |       <td>False</td>
 980 |       <td>False</td>
 981 |       <td>False</td>
 982 |       <td>False</td>
 983 |       <td>False</td>
 984 |     </tr>
 985 |   </tbody>
 986 | </table>
 987 | <p>5 rows × 808 columns</p>
 988 | </div>
 989 | 
 990 | 
 991 | 
 992 | 
 993 | ```python
 994 | #if starting from a separate session from where you ran the sgRNA learning steps of Part 1, reload the following
 995 | import cPickle
 996 | with open(PICKLE_FILE) as infile:
 997 |     fitTable, estimators, scaler, reg = cPickle.load(infile)
 998 |     
 999 | transformedParams_train = pd.read_csv(TRANSFORMED_PARAM_HEADER,sep='\t')
1000 | ```
1001 | 
1002 | 
1003 | ```python
1004 | #transform and predict scores according to sgRNA prediction model
1005 | transformedParams_new = transformParams(paramTable_new, fitTable, estimators)
1006 | 
1007 | #reconcile any differences in column headers generated by automated binning
1008 | colTups = []
1009 | for (l1, l2), col in transformedParams_new.iteritems():
1010 |     colTups.append((l1,str(l2)))
1011 | transformedParams_new.columns = pd.MultiIndex.from_tuples(colTups)
1012 | 
1013 | predictedScores_new = pd.Series(reg.predict(scaler.transform(transformedParams_new.loc[:, transformedParams_train.columns].fillna(0).values)), index=transformedParams_new.index)
1014 | ```
1015 | 
1016 | 
1017 | ```python
1018 | predictedScores_new.head()
1019 | ```
1020 | 
1021 | 
1022 | 
1023 | 
1024 |     sgId
1025 |     AADACL2_+_151451720.23-P1P2    0.641245
1026 |     AADACL2_+_151451732.23-P1P2    0.693926
1027 |     AADACL2_+_151451733.23-P1P2    0.655759
1028 |     AADACL2_-_151451809.23-P1P2    0.500835
1029 |     AADACL2_-_151451816.23-P1P2    0.434376
1030 |     dtype: float64
1031 | 
1032 | 
1033 | 
1034 | 
1035 | ```python
1036 | libraryTable_new.to_csv(LIBRARY_TABLE_PATH,sep='\t')
1037 | sgInfoTable_new.to_csv(sgRNA_INFO_PATH,sep='\t')
1038 | predictedScores_new.to_csv(PREDICTED_SCORES_PATH, sep='\t')
1039 | ```
1040 | 
1041 | # 3. Construct sgRNA libraries
1042 | ## Score sgRNAs for off-target potential
1043 | 
1044 | ### There are many ways to score sgRNAs as off-target; below is one listed one method that is simple and flexible, but ignores gapped alignments, alternate PAMs, and uses bowtie which may not be maximally sensitive in all cases
1045 | 
1046 | 
1047 | ```python
1048 | !mkdir temp_bowtie_files
1049 | ```
1050 | 
1051 | 
1052 | ```python
1053 | #output all sequences to a temporary FASTQ file for running bowtie alignment
1054 | fqFile = 'temp_bowtie_files/bowtie_input.fq'
1055 | 
1056 | def outputTempBowtieFastq(libraryTable, outputFileName):
1057 |     phredString = 'I4!=======44444+++++++' #weighting for how impactful mismatches are along sgRNA sequence 
1058 |     with open(outputFileName,'w') as outfile:
1059 |         for name, row in libraryTable.iterrows():
1060 |             outfile.write('@' + name + '\n')
1061 |             outfile.write('CCN' + str(Seq.Seq(row['sequence'][1:]).reverse_complement()) + '\n')
1062 |             outfile.write('+\n')
1063 |             outfile.write(phredString + '\n')
1064 |             
1065 | outputTempBowtieFastq(libraryTable_new, fqFile)
1066 | ```
1067 | 
1068 | 
1069 | ```python
1070 | import subprocess
1071 | 
1072 | #specifying a list of parameters to run bowtie with
1073 | #each tuple contains
1074 | # *the mismatch threshold below which a site is considered a potential off-target (higher is more stringent)
1075 | # *the number of sites allowed (1 is minimum since each sgRNA should have one true site in genome)
1076 | # *the genome index against which to align the sgRNA sequences; these can be custom built to only consider sites near TSSs
1077 | # *a name for the bowtie run to create appropriately named output files
1078 | alignmentList = [(39,1,'large_data_files/hg19.ensemblTSSflank500b','39_nearTSS'),
1079 |                 (31,1,'large_data_files/hg19.ensemblTSSflank500b','31_nearTSS'),
1080 |                 (21,1,'large_data_files/hg19_maskChrMandPAR','21_genome'),
1081 |                 (31,2,'large_data_files/hg19.ensemblTSSflank500b','31_2_nearTSS'),
1082 |                 (31,3,'large_data_files/hg19.ensemblTSSflank500b','31_3_nearTSS')]
1083 | 
1084 | alignmentColumns = []
1085 | for btThreshold, mflag, bowtieIndex, runname in alignmentList:
1086 | 
1087 |     alignedFile = 'temp_bowtie_files/' + runname + '_aligned.txt'
1088 |     unalignedFile = 'temp_bowtie_files/' + runname + '_unaligned.fq'
1089 |     maxFile = 'temp_bowtie_files/' + runname + '_max.fq'
1090 |     
1091 |     bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile
1092 |     print bowtieString
1093 |     print subprocess.call(bowtieString, shell=True) #0 means finished without errors
1094 | 
1095 |     #parse through the file of sgRNAs that exceeded "m", the maximum allowable alignments, and mark "True" any that are found
1096 |     try:
1097 |         with open(maxFile) as infile:
1098 |             sgsAligning = set()
1099 |             for i, line in enumerate(infile):
1100 |                 if i%4 == 0: #id line
1101 |                     sgsAligning.add(line.strip()[1:])
1102 |     except IOError: #no sgRNAs exceeded m, so no maxFile created
1103 |         sgsAligning = set()
1104 |                     
1105 |     alignmentColumns.append(libraryTable_new.apply(lambda row: row.name in sgsAligning, axis=1))
1106 |     
1107 | #collate results into a table, and flip the boolean values to yield the sgRNAs that passed filter as True
1108 | alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3]).ne(True)
1109 | ```
1110 | 
1111 |     bowtie -n 3 -l 15 -e 39 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/39_nearTSS_unaligned.fq --max temp_bowtie_files/39_nearTSS_max.fq  -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/39_nearTSS_aligned.txt
1112 |     0
1113 |     bowtie -n 3 -l 15 -e 31 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/31_nearTSS_unaligned.fq --max temp_bowtie_files/31_nearTSS_max.fq  -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/31_nearTSS_aligned.txt
1114 |     0
1115 |     bowtie -n 3 -l 15 -e 21 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19_maskChrMandPAR --suppress 5,6,7 --un temp_bowtie_files/21_genome_unaligned.fq --max temp_bowtie_files/21_genome_max.fq  -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/21_genome_aligned.txt
1116 |     0
1117 |     bowtie -n 3 -l 15 -e 31 -m 2 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/31_2_nearTSS_unaligned.fq --max temp_bowtie_files/31_2_nearTSS_max.fq  -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/31_2_nearTSS_aligned.txt
1118 |     0
1119 |     bowtie -n 3 -l 15 -e 31 -m 3 --nomaqround -a --tryhard -p 16 --chunkmbs 256 large_data_files/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/31_3_nearTSS_unaligned.fq --max temp_bowtie_files/31_3_nearTSS_max.fq  -q temp_bowtie_files/bowtie_input.fq temp_bowtie_files/31_3_nearTSS_aligned.txt
1120 |     0
1121 | 
1122 | 
1123 | 
1124 | ```python
1125 | alignmentTable.head() #True = passed threshold
1126 | ```
1127 | 
1128 | 
1129 | 
1130 | 
1131 | <div>
1132 | <table border="1" class="dataframe">
1133 |   <thead>
1134 |     <tr style="text-align: right;">
1135 |       <th></th>
1136 |       <th>39_nearTSS</th>
1137 |       <th>31_nearTSS</th>
1138 |       <th>21_genome</th>
1139 |       <th>31_2_nearTSS</th>
1140 |       <th>31_3_nearTSS</th>
1141 |     </tr>
1142 |     <tr>
1143 |       <th>sgId</th>
1144 |       <th></th>
1145 |       <th></th>
1146 |       <th></th>
1147 |       <th></th>
1148 |       <th></th>
1149 |     </tr>
1150 |   </thead>
1151 |   <tbody>
1152 |     <tr>
1153 |       <th>AADACL2_+_151451720.23-P1P2</th>
1154 |       <td>True</td>
1155 |       <td>True</td>
1156 |       <td>False</td>
1157 |       <td>True</td>
1158 |       <td>True</td>
1159 |     </tr>
1160 |     <tr>
1161 |       <th>AADACL2_+_151451732.23-P1P2</th>
1162 |       <td>True</td>
1163 |       <td>True</td>
1164 |       <td>True</td>
1165 |       <td>True</td>
1166 |       <td>True</td>
1167 |     </tr>
1168 |     <tr>
1169 |       <th>AADACL2_+_151451733.23-P1P2</th>
1170 |       <td>True</td>
1171 |       <td>True</td>
1172 |       <td>True</td>
1173 |       <td>True</td>
1174 |       <td>True</td>
1175 |     </tr>
1176 |     <tr>
1177 |       <th>AADACL2_-_151451809.23-P1P2</th>
1178 |       <td>False</td>
1179 |       <td>True</td>
1180 |       <td>False</td>
1181 |       <td>True</td>
1182 |       <td>True</td>
1183 |     </tr>
1184 |     <tr>
1185 |       <th>AADACL2_-_151451816.23-P1P2</th>
1186 |       <td>True</td>
1187 |       <td>True</td>
1188 |       <td>False</td>
1189 |       <td>True</td>
1190 |       <td>True</td>
1191 |     </tr>
1192 |   </tbody>
1193 | </table>
1194 | </div>
1195 | 
1196 | 
1197 | 
1198 | ## Pick the top sgRNAs for a library, given predicted activity scores and off-target filtering
1199 | 
1200 | 
1201 | ```python
1202 | #combine all generated data into one master table
1203 | predictedScores_new.name = 'predicted score'
1204 | v2Table = pd.concat((libraryTable_new, predictedScores_new, alignmentTable, sgInfoTable_new), axis=1, keys=['library table v2', 'predicted score', 'off-target filters', 'sgRNA info'])
1205 | ```
1206 | 
1207 | 
1208 | ```python
1209 | import re
1210 | #for our pCRISPRi/a-v2 vector, we append flanking sequences to each sgRNA sequence for cloning and require the oligo to contain
1211 | #exactly 1 BstXI and BlpI site each for cloning, and exactly 0 SbfI sites for sequencing sample preparation
1212 | restrictionSites = {re.compile('CCA......TGG'):1,
1213 |                    re.compile('GCT.AGC'):1,
1214 |                    re.compile('CCTGCAGG'):0}
1215 | 
1216 | def matchREsites(sequence, REdict):
1217 |     seq = sequence.upper()
1218 |     for resite, numMatchesExpected in restrictionSites.iteritems():
1219 |         if len(resite.findall(seq)) != numMatchesExpected:
1220 |             return False
1221 |         
1222 |     return True
1223 | 
1224 | def checkOverlaps(leftPosition, acceptedLeftPositions, nonoverlapMin):
1225 |     for pos in acceptedLeftPositions:
1226 |         if abs(pos - leftPosition) < nonoverlapMin:
1227 |             return False
1228 |     return True
1229 | ```
1230 | 
1231 | 
1232 | ```python
1233 | #flanking sequences
1234 | upstreamConstant = 'CCACCTTGTTG'
1235 | downstreamConstant = 'GTTTAAGAGCTAAGCTG'
1236 | 
1237 | #minimum overlap between two sgRNAs targeting the same TSS
1238 | nonoverlapMin = 3
1239 | 
1240 | #number of sgRNAs to pick per gene/TSS
1241 | sgRNAsToPick = 10
1242 | 
1243 | #list of off-target filter (or combinations of filters) levels, matching the names in the alignment table above
1244 | offTargetLevels = [['31_nearTSS', '21_genome'],
1245 |                   ['31_nearTSS'],
1246 |                   ['21_genome'],
1247 |                   ['31_2_nearTSS'],
1248 |                   ['31_3_nearTSS']]
1249 | 
1250 | #for each gene/TSS, go through each sgRNA in descending order of predicted score
1251 | #if an sgRNA passes the restriction site, overlap, and off-target filters, accept it into the library
1252 | #if the number of sgRNAs accepted is less than sgRNAsToPick, reduce off-target stringency by one and continue
1253 | v2Groups = v2Table.groupby([('library table v2','gene'),('library table v2','transcripts')])
1254 | newSgIds = []
1255 | unfinishedTss = []
1256 | for (gene, transcript), group in v2Groups:
1257 |     geneSgIds = []
1258 |     geneLeftPositions = []
1259 |     empiricalSgIds = dict()
1260 |     
1261 |     stringency = 0
1262 |     
1263 |     while len(geneSgIds) < sgRNAsToPick and stringency < len(offTargetLevels):
1264 |         for sgId_v2, row in group.sort_values(('predicted score','predicted score'), ascending=False).iterrows():
1265 |             oligoSeq = upstreamConstant + row[('library table v2','sequence')] + downstreamConstant
1266 |             leftPos = row[('sgRNA info', 'position')] - (23 if row[('sgRNA info', 'strand')] == '-' else 0)
1267 |             if len(geneSgIds) < sgRNAsToPick and row['off-target filters'].loc[offTargetLevels[stringency]].all() \
1268 |                 and matchREsites(oligoSeq, restrictionSites) \
1269 |                 and checkOverlaps(leftPos, geneLeftPositions, nonoverlapMin):
1270 |                 geneSgIds.append((sgId_v2,
1271 |                                   gene,transcript,
1272 |                                   row[('library table v2','sequence')], oligoSeq,
1273 |                                   row[('predicted score','predicted score')], np.nan,
1274 |                                  stringency))
1275 |                 geneLeftPositions.append(leftPos)
1276 |                 
1277 |         stringency += 1
1278 |             
1279 |     if len(geneSgIds) < sgRNAsToPick:
1280 |         unfinishedTss.append((gene, transcript)) #if the number of accepted sgRNAs is still less than sgRNAsToPick, discard gene
1281 |     else:
1282 |         newSgIds.extend(geneSgIds)
1283 |         
1284 | libraryTable_complete = pd.DataFrame(newSgIds, columns = ['sgID', 'gene', 'transcript','protospacer sequence', 'oligo sequence',
1285 |  'predicted score', 'empirical score', 'off-target stringency']).set_index('sgID')
1286 | ```
1287 | 
1288 | 
1289 | ```python
1290 | print len(libraryTable_complete)
1291 | ```
1292 | 
1293 |     80
1294 | 
1295 | 
1296 | 
1297 | ```python
1298 | #number of sgRNAs accepted at each stringency level
1299 | libraryTable_complete.groupby('off-target stringency').agg(len).iloc[:,0]
1300 | ```
1301 | 
1302 | 
1303 | 
1304 | 
1305 |     off-target stringency
1306 |     0    80
1307 |     Name: gene, dtype: int64
1308 | 
1309 | 
1310 | 
1311 | 
1312 | ```python
1313 | #number of TSSs with fewer than required number of sgRNAs (and thus not included in the library)
1314 | print len(unfinishedTss)
1315 | ```
1316 | 
1317 |     0
1318 | 
1319 | 
1320 | 
1321 | ```python
1322 | libraryTable_complete.head()
1323 | ```
1324 | 
1325 | 
1326 | 
1327 | 
1328 | <div>
1329 | <table border="1" class="dataframe">
1330 |   <thead>
1331 |     <tr style="text-align: right;">
1332 |       <th></th>
1333 |       <th>gene</th>
1334 |       <th>transcript</th>
1335 |       <th>protospacer sequence</th>
1336 |       <th>oligo sequence</th>
1337 |       <th>predicted score</th>
1338 |       <th>empirical score</th>
1339 |       <th>off-target stringency</th>
1340 |     </tr>
1341 |     <tr>
1342 |       <th>sgID</th>
1343 |       <th></th>
1344 |       <th></th>
1345 |       <th></th>
1346 |       <th></th>
1347 |       <th></th>
1348 |       <th></th>
1349 |       <th></th>
1350 |     </tr>
1351 |   </thead>
1352 |   <tbody>
1353 |     <tr>
1354 |       <th>AADACL2_+_151451732.23-P1P2</th>
1355 |       <td>AADACL2</td>
1356 |       <td>P1P2</td>
1357 |       <td>GGTAGAGCAATTGTAGACTT</td>
1358 |       <td>CCACCTTGTTGGGTAGAGCAATTGTAGACTTGTTTAAGAGCTAAGCTG</td>
1359 |       <td>0.693926</td>
1360 |       <td>NaN</td>
1361 |       <td>0</td>
1362 |     </tr>
1363 |     <tr>
1364 |       <th>AADACL2_-_151452019.23-P1P2</th>
1365 |       <td>AADACL2</td>
1366 |       <td>P1P2</td>
1367 |       <td>GATGACTTATTGACTAAAAA</td>
1368 |       <td>CCACCTTGTTGGATGACTTATTGACTAAAAAGTTTAAGAGCTAAGCTG</td>
1369 |       <td>0.451392</td>
1370 |       <td>NaN</td>
1371 |       <td>0</td>
1372 |     </tr>
1373 |     <tr>
1374 |       <th>AADACL2_+_151452121.23-P1P2</th>
1375 |       <td>AADACL2</td>
1376 |       <td>P1P2</td>
1377 |       <td>GACTGTTACTCACAGATATA</td>
1378 |       <td>CCACCTTGTTGGACTGTTACTCACAGATATAGTTTAAGAGCTAAGCTG</td>
1379 |       <td>0.426695</td>
1380 |       <td>NaN</td>
1381 |       <td>0</td>
1382 |     </tr>
1383 |     <tr>
1384 |       <th>AADACL2_-_151451828.23-P1P2</th>
1385 |       <td>AADACL2</td>
1386 |       <td>P1P2</td>
1387 |       <td>GTGGAAAAAGGGATATTATG</td>
1388 |       <td>CCACCTTGTTGGTGGAAAAAGGGATATTATGGTTTAAGAGCTAAGCTG</td>
1389 |       <td>0.404655</td>
1390 |       <td>NaN</td>
1391 |       <td>0</td>
1392 |     </tr>
1393 |     <tr>
1394 |       <th>AADACL2_-_151451931.23-P1P2</th>
1395 |       <td>AADACL2</td>
1396 |       <td>P1P2</td>
1397 |       <td>GAGCTGGAAAATAATGGCCT</td>
1398 |       <td>CCACCTTGTTGGAGCTGGAAAATAATGGCCTGTTTAAGAGCTAAGCTG</td>
1399 |       <td>0.404269</td>
1400 |       <td>NaN</td>
1401 |       <td>0</td>
1402 |     </tr>
1403 |   </tbody>
1404 | </table>
1405 | </div>
1406 | 
1407 | 
1408 | 
1409 | # 5. Design negative controls matching the base composition of the library
1410 | 
1411 | 
1412 | ```python
1413 | #calcluate the base frequency at each position of the sgRNA, then generate random sequences weighted by this frequency
1414 | def getBaseFrequencies(libraryTable, baseConversion = {'G':0, 'C':1, 'T':2, 'A':3}):
1415 |     baseArray = np.zeros((len(libraryTable),20))
1416 | 
1417 |     for i, (index, seq) in enumerate(libraryTable['protospacer sequence'].iteritems()):
1418 |         for j, char in enumerate(seq.upper()):
1419 |             baseArray[i,j] = baseConversion[char]
1420 | 
1421 |     baseTable = pd.DataFrame(baseArray, index = libraryTable.index)
1422 |     
1423 |     baseFrequencies = baseTable.apply(lambda col: col.groupby(col).agg(len)).fillna(0) / len(baseTable)
1424 |     baseFrequencies.index = ['G','C','T','A']
1425 |     
1426 |     baseCumulativeFrequencies = baseFrequencies.copy()
1427 |     baseCumulativeFrequencies.loc['C'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C']
1428 |     baseCumulativeFrequencies.loc['T'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T']
1429 |     baseCumulativeFrequencies.loc['A'] = baseFrequencies.loc['G'] + baseFrequencies.loc['C'] + baseFrequencies.loc['T'] + baseFrequencies.loc['A']
1430 | 
1431 |     return baseFrequencies, baseCumulativeFrequencies
1432 | 
1433 | def generateRandomSequence(baseCumulativeFrequencies):
1434 |     randArray = np.random.random(baseCumulativeFrequencies.shape[1])
1435 |     
1436 |     seq = []
1437 |     for i, col in baseCumulativeFrequencies.iteritems():
1438 |         for base, freq in col.iteritems():
1439 |             if randArray[i] < freq:
1440 |                 seq.append(base)
1441 |                 break
1442 |                 
1443 |     return ''.join(seq)
1444 | ```
1445 | 
1446 | 
1447 | ```python
1448 | baseCumulativeFrequencies = getBaseFrequencies(libraryTable_complete)[1]
1449 | negList = []
1450 | numberToGenerate = 1000 #can generate many more; some will be filtered out by off-targets, and you can always select an arbitrary subset for inclusion into the library
1451 | for i in range(numberToGenerate):
1452 |     negList.append(generateRandomSequence(baseCumulativeFrequencies))
1453 | negTable = pd.DataFrame(negList, index=['non-targeting_' + str(i) for i in range(numberToGenerate)], columns = ['sequence'])
1454 | 
1455 | fqFile = 'temp_bowtie_files/bowtie_input_negs.fq'
1456 | outputTempBowtieFastq(negTable, fqFile)
1457 | ```
1458 | 
1459 | 
1460 | ```python
1461 | #similar to targeting sgRNA off-target scoring, but looking for sgRNAs with 0 alignments
1462 | alignmentList = [(31,1,'~/indices/hg19.ensemblTSSflank500b','31_nearTSS_negs'),
1463 |                 (21,1,'~/indices/hg19_maskChrMandPAR','21_genome_negs')]
1464 | 
1465 | alignmentColumns = []
1466 | for btThreshold, mflag, bowtieIndex, runname in alignmentList:
1467 | 
1468 |     alignedFile = 'temp_bowtie_files/' + runname + '_aligned.txt'
1469 |     unalignedFile = 'temp_bowtie_files/' + runname + '_unaligned.fq'
1470 |     maxFile = 'temp_bowtie_files/' + runname + '_max.fq'
1471 |     
1472 |     bowtieString = 'bowtie -n 3 -l 15 -e '+str(btThreshold)+' -m ' + str(mflag) + ' --nomaqround -a --tryhard -p 16 --chunkmbs 256 ' + bowtieIndex + ' --suppress 5,6,7 --un ' + unalignedFile + ' --max ' + maxFile + ' '+ ' -q '+fqFile+' '+ alignedFile
1473 |     print bowtieString
1474 |     print subprocess.call(bowtieString, shell=True)
1475 | 
1476 |     #read unaligned file for negs, and then don't flip boolean of alignmentTable
1477 |     with open(unalignedFile) as infile:
1478 |         sgsAligning = set()
1479 |         for i, line in enumerate(infile):
1480 |             if i%4 == 0: #id line
1481 |                 sgsAligning.add(line.strip()[1:])
1482 | 
1483 |     alignmentColumns.append(negTable.apply(lambda row: row.name in sgsAligning, axis=1))
1484 |     
1485 | alignmentTable = pd.concat(alignmentColumns,axis=1, keys=zip(*alignmentList)[3])
1486 | alignmentTable.head()
1487 | ```
1488 | 
1489 |     bowtie -n 3 -l 15 -e 31 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 ~/indices/hg19.ensemblTSSflank500b --suppress 5,6,7 --un temp_bowtie_files/31_nearTSS_negs_unaligned.fq --max temp_bowtie_files/31_nearTSS_negs_max.fq  -q temp_bowtie_files/bowtie_input_negs.fq temp_bowtie_files/31_nearTSS_negs_aligned.txt
1490 |     0
1491 |     bowtie -n 3 -l 15 -e 21 -m 1 --nomaqround -a --tryhard -p 16 --chunkmbs 256 ~/indices/hg19_maskChrMandPAR --suppress 5,6,7 --un temp_bowtie_files/21_genome_negs_unaligned.fq --max temp_bowtie_files/21_genome_negs_max.fq  -q temp_bowtie_files/bowtie_input_negs.fq temp_bowtie_files/21_genome_negs_aligned.txt
1492 |     0
1493 | 
1494 | 
1495 | 
1496 | 
1497 | 
1498 | <div>
1499 | <table border="1" class="dataframe">
1500 |   <thead>
1501 |     <tr style="text-align: right;">
1502 |       <th></th>
1503 |       <th>31_nearTSS_negs</th>
1504 |       <th>21_genome_negs</th>
1505 |     </tr>
1506 |   </thead>
1507 |   <tbody>
1508 |     <tr>
1509 |       <th>non-targeting_0</th>
1510 |       <td>True</td>
1511 |       <td>True</td>
1512 |     </tr>
1513 |     <tr>
1514 |       <th>non-targeting_1</th>
1515 |       <td>True</td>
1516 |       <td>True</td>
1517 |     </tr>
1518 |     <tr>
1519 |       <th>non-targeting_2</th>
1520 |       <td>True</td>
1521 |       <td>True</td>
1522 |     </tr>
1523 |     <tr>
1524 |       <th>non-targeting_3</th>
1525 |       <td>True</td>
1526 |       <td>False</td>
1527 |     </tr>
1528 |     <tr>
1529 |       <th>non-targeting_4</th>
1530 |       <td>True</td>
1531 |       <td>True</td>
1532 |     </tr>
1533 |   </tbody>
1534 | </table>
1535 | </div>
1536 | 
1537 | 
1538 | 
1539 | 
1540 | ```python
1541 | acceptedNegList = []
1542 | negCount = 0
1543 | for i, (name, row) in enumerate(pd.concat((negTable,alignmentTable),axis=1, keys=['seq','alignment']).iterrows()):
1544 |     oligo = upstreamConstant + row['seq','sequence'] + downstreamConstant
1545 |     if row['alignment'].all() and matchREsites(oligo, restrictionSites):
1546 |         acceptedNegList.append(('non-targeting_%05d' % negCount, 'negative_control', 'na', row['seq','sequence'], oligo, 0))
1547 |         negCount += 1
1548 |         
1549 | acceptedNegs = pd.DataFrame(acceptedNegList, columns = ['sgId', 'gene', 'transcript', 'protospacer sequence', 'oligo sequence', 'off-target stringency']).set_index('sgId')
1550 | ```
1551 | 
1552 | 
1553 | ```python
1554 | acceptedNegs.head()
1555 | ```
1556 | 
1557 | 
1558 | 
1559 | 
1560 | <div>
1561 | <table border="1" class="dataframe">
1562 |   <thead>
1563 |     <tr style="text-align: right;">
1564 |       <th></th>
1565 |       <th>gene</th>
1566 |       <th>transcript</th>
1567 |       <th>protospacer sequence</th>
1568 |       <th>oligo sequence</th>
1569 |       <th>off-target stringency</th>
1570 |     </tr>
1571 |     <tr>
1572 |       <th>sgId</th>
1573 |       <th></th>
1574 |       <th></th>
1575 |       <th></th>
1576 |       <th></th>
1577 |       <th></th>
1578 |     </tr>
1579 |   </thead>
1580 |   <tbody>
1581 |     <tr>
1582 |       <th>non-targeting_00000</th>
1583 |       <td>negative_control</td>
1584 |       <td>na</td>
1585 |       <td>GGTTTCGCGCGCTTACAGAT</td>
1586 |       <td>CCACCTTGTTGGGTTTCGCGCGCTTACAGATGTTTAAGAGCTAAGCTG</td>
1587 |       <td>0</td>
1588 |     </tr>
1589 |     <tr>
1590 |       <th>non-targeting_00001</th>
1591 |       <td>negative_control</td>
1592 |       <td>na</td>
1593 |       <td>GGTGGTCGAAGATAGCGAGC</td>
1594 |       <td>CCACCTTGTTGGGTGGTCGAAGATAGCGAGCGTTTAAGAGCTAAGCTG</td>
1595 |       <td>0</td>
1596 |     </tr>
1597 |     <tr>
1598 |       <th>non-targeting_00002</th>
1599 |       <td>negative_control</td>
1600 |       <td>na</td>
1601 |       <td>GCTCTTGACAAATTCAAGCT</td>
1602 |       <td>CCACCTTGTTGGCTCTTGACAAATTCAAGCTGTTTAAGAGCTAAGCTG</td>
1603 |       <td>0</td>
1604 |     </tr>
1605 |     <tr>
1606 |       <th>non-targeting_00003</th>
1607 |       <td>negative_control</td>
1608 |       <td>na</td>
1609 |       <td>GGCCGGGAGAGCGGGAACTC</td>
1610 |       <td>CCACCTTGTTGGGCCGGGAGAGCGGGAACTCGTTTAAGAGCTAAGCTG</td>
1611 |       <td>0</td>
1612 |     </tr>
1613 |     <tr>
1614 |       <th>non-targeting_00004</th>
1615 |       <td>negative_control</td>
1616 |       <td>na</td>
1617 |       <td>GTCGCAAGCCGGGGTAGGGT</td>
1618 |       <td>CCACCTTGTTGGTCGCAAGCCGGGGTAGGGTGTTTAAGAGCTAAGCTG</td>
1619 |       <td>0</td>
1620 |     </tr>
1621 |   </tbody>
1622 | </table>
1623 | </div>
1624 | 
1625 | 
1626 | 
1627 | 
1628 | ```python
1629 | libraryTable_complete.to_csv(LIBRARY_WITHOUT_NEGATIVES_PATH, sep='\t')
1630 | acceptedNegs.to_csv(NEGATIVE_CONTROLS_PATH,sep='\t')
1631 | ```
1632 | 
1633 | # 6. Finalizing library design
1634 | 
1635 | * divide genes into sublibrary groups (if required)
1636 | * assign negative control sgRNAs to sublibrary groups; ~1-2% of the number of sgRNAs in the library is a good rule-of-thumb
1637 | * append PCR adapter sequences (~18bp) to each end of the oligo sequences to enable amplification of the oligo pool; each sublibary should have an orthogonal sequence so they can be cloned separately
1638 | 
1639 | 
1640 | ```python
1641 | 
1642 | ```
1643 | 


--------------------------------------------------------------------------------