├── Correcting_Spreading_of_Signal_Notebook.ipynb
├── LICENSE
├── README.md
├── data
├── info.txt
├── leo1_anon.csv
├── leo2_anon.csv
├── leo3_anon.csv
└── leo4_anon.csv
├── mHSC_plate1HiSeq_counts_IndexInfo_anon.csv
├── mHSC_plate1NextSeq_counts_IndexInfo_reindex_anon.csv
├── sandbergCorrection_analyzeClustering.ipynb
└── unspread.py
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018 Anton Larsson
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Computational correction of index switching in multiplexed sequencing libraries
2 |
3 | Supplementary information to the article "Computational correction of index switching in multiplexed sequencing libraries" by Anton JM Larsson, Geoff Stanley, Rahul Sinha, Irving L. Weissman, Rickard Sandberg available now in Nature Methods (http://dx.doi.org/10.1038/nmeth.4666)
4 |
5 | The Jupyter notebook (Correcting_Spreading_of_Signal_Notebook.ipynb) contains Python code for the analysis and correction of index-swapping, including the generation of Figures 1, 2A-C, S1 and S2 in the manuscript (written by Anton JM Larsson). The notebook (sandbergCorrection_analyzeClustering.ipynb) contains the R code to reproduce Figures 2D-E and S3 of the manuscript (written by Geoff Stanley).
6 |
7 | ## _unspread.py_
8 |
9 | _unspread.py_ estimates the percentage of contaminating reads in the experiment, estimates the 'rate of spreading', and corrects the read counts if the experiment is affected to a sufficient degree. The _unspread.py_ script requires a table of read counts supplied as a .csv file with added information regarding each cell's index barcodes.
10 |
11 | ### System Requirements
12 |
13 | _unspready.py_ is a python3 script with dependencies:
14 |
15 | ```
16 | pandas: 0.19.2
17 | numpy: 1.9.0
18 | matplotlib: 2.0
19 | statsmodels: 0.6.1
20 | scipy: 1.0.0
21 | patsy: 0.4.1
22 | ```
23 | No further installation is needed.
24 |
25 | ### Usage
26 |
27 | usage: unspread.py [-h] [--i5 STRING] [--i7 STRING] [--rows INTEGER]
28 | [--cols INTEGER] [--idx_col INTEGER] [--sep CHAR]
29 | [--h INTEGER] [--c INTEGER] [--t FLOAT]
30 | [--idx_in_id BOOLEAN] [--delim_idx CHAR] [--column BOOLEAN]
31 | filename
32 |
33 | Unspread: Computational correction of barcode index spreading
34 |
35 | **positional arguments:**
36 |
37 | filename .csv file with counts
38 |
39 | **optional arguments:**
40 |
41 | -h, --help show this help message and exit
42 |
43 | --i5 STRING Index name of i5 barcodes (default: 'i5.index.name')
44 |
45 | --i7 STRING Index name of i7 barcodes (default: 'i7.index.name')
46 |
47 | --rows INTEGER Number of rows in plate (default: 16)
48 |
49 | --cols INTEGER Number of columns in plate (default: 24)
50 |
51 | --idx_col INTEGER Which column serves as the index (default: 0)
52 |
53 | --sep CHAR The separator in the .csv file (default ',')
54 |
55 | --h INTEGER The number of reads to use to be considered highly
56 | expressed in only one cell (default: 30)
57 |
58 | --c INTEGER Cutoff to remove addition false positives (default: 5)
59 |
60 | --t FLOAT Threshold for acceptable fraction of spread counts
61 | (default: 0.05)
62 |
63 | --idx_in_id BOOLEAN If the index is in the cell id (i.e. cellid_i5_i7)
64 | (Default: 0 (False), set to 1 otherwise (True))
65 |
66 | --delim_idx CHAR If the index is in the cell id, the delimiting
67 | character (Default: '_')
68 |
69 | --column BOOLEAN If each column is represents a cell, otherwise each
70 | row. (default: 1 (True), set to 0 otherwise (False))
71 |
72 | ### Output
73 |
74 | _unspread.py_ outputs a set of figures with diagnostic information comparable to the figures in the article. A log file is also saved. If the plate is affected a corrected .csv file will also be made.
75 |
76 | ### Example
77 |
78 | An example from the first plate in the manuscript:
79 |
80 | |cell.name | N.index.name | S.index.name | 0610005C13Rik | 0610007C21Rik | ...|
81 | | --- | --- | --- | --- | --- | --- |
82 | |HSC02_a_p1c7r2_P01 | N701 | S522 | 0 | 117 | ...|
83 | |HSC02_a_p1c5r5_P03 | N702 | S522 | 0 | 5 | ...|
84 |
85 | In this particular example, genes are structured by column and cells by rows but the converse is also supported.
86 |
87 | To run the correction of the first plate in the manuscript:
88 | ```
89 | ./unspread.py mHSC_plate1HiSeq_counts_IndexInfo_anon.csv --i5 'S.index.name' --i7 'N.index.name' --column 0 --sep ' '
90 | ```
91 | This command should not take longer than a minute.
92 |
93 | The expected command line output is:
94 | ```
95 | Reading file: mHSC_plate1HiSeq_counts_IndexInfo_anon.csv
96 |
97 | Estimating spreading from mHSC_plate1HiSeq_counts_IndexInfo_anon.csv
98 |
99 | Found expression to be biased along a certain column and row combination 753 times out of 899
100 |
101 | Estimated the median rate of spreading to be 0.0098
102 |
103 | Estimated fraction of spread reads to be 0.14827 and variance explained R-squared = 0.8996
104 |
105 | Saving figure from analysis to mHSC_plate1HiSeq_counts_IndexInfo_anon_figures.pdf
106 |
107 | Saving log file from analysis to mHSC_plate1HiSeq_counts_IndexInfo_anon_unspread.log
108 |
109 | Correcting spreading for each gene
110 |
111 | Saving correction to mHSC_plate1HiSeq_counts_IndexInfo_anon_corrected.csv
112 | ```
113 |
114 | The genes in the manuscript, _Mki67_ and _Tacr_, have ID 7963 and 12319 respectively.
115 |
--------------------------------------------------------------------------------
/data/info.txt:
--------------------------------------------------------------------------------
1 | Here are read counts tables with anonymised gene lists which are used in the paper.
2 |
--------------------------------------------------------------------------------
/sandbergCorrection_analyzeClustering.ipynb:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
747 |
748 | You signed in with another tab or window. Reload to refresh your session.
749 | You signed out in another tab or window. Reload to refresh your session.
750 |
751 |
752 |
753 |
754 |
755 |
758 |
759 |
760 |
761 |
762 |
763 |
764 |
765 |
--------------------------------------------------------------------------------
/unspread.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import os, sys,re
3 | import argparse
4 | import pandas as pd
5 | import numpy as np
6 | from matplotlib import use
7 | import matplotlib.pyplot as plt
8 | from statsmodels.sandbox.stats.multicomp import multipletests
9 | import statsmodels.api as sm
10 | from scipy.stats import hypergeom
11 | from scipy.linalg import solve_sylvester
12 | from scipy.stats import binom_test
13 | from patsy import dmatrices
14 |
15 | # Used for getting genes to estimate spreading
16 | def num_counts(col, high):
17 | return np.sum(col > high)
18 |
19 |
20 | parser = argparse.ArgumentParser(description='Unspread: Computational correction of barcode index spreading')
21 | parser.add_argument('filename', metavar='filename', type=str, nargs=1,help='.csv file with counts' )
22 | parser.add_argument('--i5', metavar='STRING', type=str, nargs=1, default=['i5.index.name'], help='Index name of i5 barcodes (default: \'i5.index.name\')')
23 | parser.add_argument('--i7', metavar='STRING', type=str, nargs=1, default=['i7.index.name'], help='Index name of i7 barcodes (default: \'i7.index.name\')')
24 | parser.add_argument('--rows', metavar='INTEGER', default=[16], type=int, nargs=1, help='Number of rows in plate (default: 16)')
25 | parser.add_argument('--cols', metavar='INTEGER', default=[24], type=int, nargs=1, help='Number of columns in plate (default: 24)')
26 | parser.add_argument('--idx_col', metavar='INTEGER', default=[0], type=int, nargs=1, help='Which column serves as the index (default: 0)')
27 | parser.add_argument('--sep', metavar='CHAR', default=[','], type=str, nargs=1, help='The separator in the .csv file (default \',\')')
28 | parser.add_argument('--h', metavar='INTEGER', default=[30], type=int, nargs=1, help='The number of reads to use to be considered highly expressed in only one cell (default: 30)')
29 | parser.add_argument('--c', metavar='INTEGER', default=[5], type=int, nargs=1, help='Cutoff to remove addition false positives (default: 5)')
30 | parser.add_argument('--t', metavar='FLOAT', default=[0.05], type=float, nargs=1, help='Threshold for acceptable fraction of spread counts (default: 0.05)')
31 | parser.add_argument('--idx_in_id', metavar='BOOLEAN', default=[0], type=float, nargs=1, help='If the index is in the cell id (i.e. cellid_i5_i7) (Default: 0 (False), set to 1 otherwise (True))')
32 | parser.add_argument('--delim_idx', metavar='CHAR', default=['_'], type=str, nargs=1, help='If the index is in the cell id, the delimiting character (Default: \'_\')')
33 | parser.add_argument('--column', metavar='BOOLEAN', default=[1], type=int, nargs=1, help='If each column is represents a cell, otherwise each row. (default: 1 (True), set to 0 otherwise (False))')
34 | parser.add_argument('--rate', metavar='FLOAT', default=[0.0], type=float, nargs=1, help='Set spreading rate manually (overrides any estimated rate).')
35 |
36 | args = parser.parse_args()
37 | filename = args.filename[0]
38 | i5_index_name = args.i5[0]
39 | i7_index_name = args.i7[0]
40 | separator = args.sep[0]
41 | n_rows = args.rows[0]
42 | n_cols = args.cols[0]
43 | idx = args.idx_col[0]
44 | high = args.h[0]
45 | c = args.c[0]
46 | threshold = args.t[0]
47 | column = args.column[0]
48 | idx_in_id = args.idx_in_id[0]
49 | delim_idx = args.delim_idx[0]
50 | r = args.rate[0]
51 |
52 | print('Reading file: {}'.format(filename))
53 |
54 | # Load in count table
55 | df = pd.read_csv(filename, index_col=idx, sep=separator)
56 | # Transform the data frame if each column represents a cell
57 | if np.bool_(column):
58 | df = df.T
59 | # Turns the end of the cell id strings into index name ids
60 | if np.bool_(idx_in_id):
61 | i5_index_list = []
62 | i7_index_list= []
63 | for i,string in enumerate(df.index.values):
64 | sp = string.split(delim_idx)
65 | i5_index_list.append(sp[-2])
66 | i7_index_list.append(sp[-1])
67 | df[i5_index_name] = i5_index_list
68 | df[i7_index_name] = i7_index_list
69 | df = df.sort_values(by=[i5_index_name,i7_index_name], ascending=[False, True])
70 | df = df.loc[:,~df.columns.duplicated()]
71 | df_noindex = df.drop([i5_index_name,i7_index_name], axis=1).astype(np.int)
72 |
73 | # Check if the shape makes sense
74 | if (df_noindex.shape[0] != n_rows*n_cols):
75 | print('Number of cells in count file not the same as specified, exiting...')
76 | quit()
77 |
78 | base = os.path.splitext(os.path.basename(filename))[0]
79 |
80 | print('Estimating spreading from {}'.format(filename))
81 |
82 | # Get genes to estimate the rate of spreading and fraction of contaminating reads.
83 | names_list = []
84 | for i, col in df_noindex.items():
85 | if num_counts(col, high) == 1:
86 | names_list.append(i)
87 |
88 | n_names = len(names_list)
89 | if n_names == 0:
90 | print('Found no genes useable to estimate spreading, exiting...')
91 | quit()
92 | # Count the number of true and spread counts respectively
93 | true_counts = np.zeros(n_names)
94 | spread_counts = np.zeros((n_names, n_rows+n_cols - 2))
95 | num_wells_counts = np.zeros(n_names)
96 |
97 | for i,namn in enumerate(names_list):
98 | if len(df_noindex[namn].values) == n_rows*n_cols:
99 | true_counts[i] = np.amax(df_noindex[namn].values)
100 | w = np.argwhere(df_noindex[namn].values.reshape(n_rows,n_cols) == true_counts[i])[0]
101 | spread_counts[i] = np.append(np.append(df_noindex[namn].values.reshape(n_rows,n_cols)[w[0], :w[1]],df_noindex[namn].values.reshape(n_rows,n_cols)[w[0], w[1]+1:] ), np.append(df_noindex[namn].values.reshape(n_rows,n_cols)[:w[0], w[1]],df_noindex[namn].values.reshape(n_rows,n_cols)[w[0]+1:, w[1]] ))
102 | num_wells_counts[i] = np.sum(df_noindex[namn].values != 0) - 1
103 |
104 |
105 | prop_spread = np.zeros((len(names_list), n_rows+n_cols - 2))
106 |
107 | for i in range(len(names_list)):
108 | prop_spread[i] = spread_counts[i]/true_counts[i]
109 |
110 | test_bias = np.zeros(n_names)
111 |
112 | # Test whether the spread counts are biased along the column and row of the source well.
113 | for i in range(len(names_list)):
114 | test_bias[i] = hypergeom.sf(np.sum(spread_counts[i] != 0)-1, n_rows*n_cols-1, n_rows+n_cols - 2, num_wells_counts[i])
115 |
116 | mt = multipletests(test_bias[test_bias != 1])[0]
117 | bias_log = 'Found expression to be biased along a certain column and row combination {} times out of {}'.format(np.sum(mt), len(mt))
118 | print(bias_log)
119 | f, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(7,14))
120 | ax1.hist(test_bias)
121 | ax1.set_title('Bias Test')
122 | ax1.set_xlabel('p value counts bias')
123 | ax1.set_ylabel('Frequency (genes)')
124 |
125 | if np.sum(mt) != 0:
126 |
127 | rate_spreading = np.median(prop_spread[test_bias != 1][mt].flatten()[prop_spread[test_bias != 1][mt].flatten() != 0])
128 | rate_log = 'Estimated the median rate of spreading to be {}'.format(np.round(rate_spreading,5))
129 | print(rate_log)
130 |
131 | ax2.hist(np.log10(prop_spread[test_bias != 1][mt].flatten()[prop_spread[test_bias != 1][mt].flatten() != 0]))
132 | ax2.set_title('Rate of Spreading')
133 | ax2.text(0.7, 0.9,'Median: ' + str(np.round(rate_spreading,5)), ha='center', va='center', transform=ax2.transAxes)
134 | ax2.set_ylabel('Frequency (Wells)')
135 | ax2.set_xlabel(r'Rate of spreading ($log_{10}$)')
136 |
137 | # Use linear regression to estimate the fraction of spread reads
138 | spread_counts = spread_counts[test_bias != 1][mt]
139 | true_counts = true_counts[test_bias != 1][mt]
140 | true_median = np.median(true_counts)
141 | model_df = pd.DataFrame([true_counts, np.apply_along_axis(np.sum, 1,spread_counts)], index=['true', 'spread']).T
142 | y,X = dmatrices('spread ~ true', data = model_df, return_type='dataframe')
143 | model = sm.OLS(y, X).fit()
144 | ols_log = 'Estimated fraction of spread reads to be {} and variance explained R-squared = {}'.format(np.round(model.params['true'],5), np.round(model.rsquared,5))
145 | print(ols_log)
146 |
147 | ax3.scatter(np.log10(true_counts), np.log10(np.apply_along_axis(np.sum, 1,spread_counts)), s = 1)
148 | ax3.set_xlabel(r'log$_{10}$(Source counts)')
149 | ax3.set_ylabel(r'log$_{10}$(Spread counts)')
150 | ax3.text(0.3, 0.9,r'coef = {}, $R^2 = $ {}'.format(np.round(model.params['true'],5), np.round(model.rsquared,5)), ha='center', va='center', transform=ax3.transAxes)
151 | ax3.set_title('Genes highly expressed in only one cell')
152 | else:
153 | print('Found no genes with bias along the column and row combination')
154 | rate_log = 'Found no genes with bias along the column and row combination'
155 | ols_log = 'Found no genes with bias along the column and row combination'
156 |
157 | print('Saving figure from analysis to {}'.format('{}_figures.pdf'.format(base)))
158 | plt.tight_layout()
159 | plt.savefig('{}_figures.pdf'.format(base))
160 |
161 | print('Saving log file from analysis to {}'.format('{}_unspread.log'.format(base)))
162 |
163 | if np.sum(mt) != 0:
164 | with open('{}_unspread.log'.format(base), "w") as log_file:
165 | print('# {}\n# {}\n# {}\n# {}\n{}\t{}\t{}\t{}\t{}\t{}\t{}'.format(filename, bias_log, rate_log, ols_log, filename, np.sum(mt), len(mt), np.round(rate_spreading,7), np.round(model.params['true'],7), np.round(model.rsquared,7), true_median), file=log_file)
166 | else:
167 | with open('{}_unspread.log'.format(base), "w") as log_file:
168 | print('# {}\n# {}\n# {}\n# {}\n{}\t{}\t{}\t{}\t{}\t{}\t{}'.format(filename, bias_log, rate_log, ols_log, filename, np.sum(mt), len(mt), 'NaN', 'NaN', 'NaN', 'Nan'), file=log_file)
169 |
170 | # You can set the threshold yourself
171 | if (np.sum(mt) == 0 or model.params['true'] < threshold) and r == 0.0:
172 | print('The experiment shows no or an acceptable amount of spreading, correction is not neccessary. Exiting...')
173 | quit()
174 | if r != 0.0:
175 | rate_spreading = r
176 | # Setting the rate of spreading matrices
177 | column_spread = np.zeros((n_rows, n_rows))
178 | row_spread = np.zeros((n_cols,n_cols))
179 | column_spread[:,:] = rate_spreading
180 | row_spread[:,:] = rate_spreading
181 | np.fill_diagonal(column_spread, 0.5 - rate_spreading/2)
182 | np.fill_diagonal(row_spread, 0.5 - rate_spreading/2)
183 |
184 | # The function which does the correction, you can set the cutoff yourself
185 | def adjust_reads(mat, column_spread = column_spread, row_spread = row_spread, cutoff = c, r = n_rows, c = n_cols):
186 | mat = np.array(mat).flatten()
187 | mat = mat.reshape(r,c)
188 | adjusted_reads = np.rint(solve_sylvester(column_spread,row_spread,mat))
189 | # A lower bound cutoff removes false positives (unfortunately also remove true reads with low counts in that cell)
190 | adjusted_reads[adjusted_reads < cutoff] = 0
191 | return adjusted_reads
192 |
193 | print('Correcting spreading for each gene')
194 |
195 | adj_list = []
196 | for i, col in df_noindex.items():
197 | adj_list.append(adjust_reads(col).flatten())
198 |
199 | # Put correction into a dataframe and save to a .csv file
200 | df_adj = pd.DataFrame(data=adj_list, index = df_noindex.columns.values, columns= df_noindex.index.values).T
201 |
202 | df_adj = pd.concat([df[i7_index_name], df[i5_index_name] , df_adj], axis = 1)
203 |
204 | print('Saving correction to {}'.format('{}_corrected.csv'.format(base)))
205 |
206 | df_adj.to_csv('{}_corrected.csv'.format(base))
207 |
--------------------------------------------------------------------------------