├── .github
    └── workflows
    │   └── pythonpublish.yml
├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.md
├── VERSION
├── examples
    ├── DQ2_2024_P08758.png
    ├── custom_list_of_proteins.tsv
    └── samples.tsv
├── ms1searchpy
    ├── __init__.py
    ├── combine.py
    ├── combine_proteins.py
    ├── directms1quant.py
    ├── directms1quantmulti.py
    ├── group_specific.py
    ├── main.py
    ├── models
    │   └── CSD_model_LCMSMS.hdf5
    ├── ms1todiffacto.py
    ├── search.py
    ├── utils.py
    └── utils_figures.py
├── requirements.txt
├── setup.py
└── test.sh


/.github/workflows/pythonpublish.yml:
--------------------------------------------------------------------------------
 1 | name: Upload Python Package
 2 | 
 3 | on:
 4 |  release:
 5 |   types: [published]
 6 | 
 7 | jobs:
 8 |   deploy:
 9 |     runs-on: ubuntu-latest
10 |     steps:
11 |     - uses: actions/checkout@v2
12 |     - name: Set up Python
13 |       uses: actions/setup-python@v2
14 |       with:
15 |         python-version: '3.x'
16 |     - name: Install dependencies
17 |       run: |
18 |         python -m pip install --upgrade pip
19 |         pip install setuptools wheel twine
20 |     - name: Build and publish
21 |       env:
22 |         TWINE_USERNAME: '__token__'
23 |         TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
24 |       run: |
25 |         python setup.py sdist bdist_wheel
26 |         twine upload dist/*
27 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *.egg-info
2 | **/__pycache__
3 | test_data*
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright 2021 Mark Ivanov
 2 | 
 3 | Licensed under the Apache License, Version 2.0 (the "License");
 4 | you may not use this file except in compliance with the License.
 5 | You may obtain a copy of the License at
 6 | 
 7 |     http://www.apache.org/licenses/LICENSE-2.0
 8 | 
 9 | Unless required by applicable law or agreed to in writing, software
10 | distributed under the License is distributed on an "AS IS" BASIS,
11 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | See the License for the specific language governing permissions and
13 | limitations under the License.


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include requirements.txt


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # ms1searchpy - a DirectMS1 proteomics search engine for LC-MS1 spectra
  2 | 
  3 | `ms1searchpy` consumes LC-MS data (**mzML**) or peptide features (**tsv**) and performs protein identification and quantitation. It is recommended to run the peptide cluster detection (biosaur2) separately so that the user can control the cluster search parameters. In addition, this will eliminate unnecessary calculations when reprocessing files.
  4 | 
  5 | ## Basic usage
  6 | 
  7 | Basic command for protein identification:
  8 | 
  9 |     biosaur2 *.mzML
 10 |     ms1searchpy *features.tsv -d path_to.FASTA
 11 |     
 12 | or
 13 | 
 14 |     ms1searchpy *.mzML -d path_to.FASTA
 15 | 
 16 | Read further for detailed info, including quantitative analysis.
 17 | 
 18 | ## Citing ms1searchpy
 19 | 
 20 | Ivanov et al. DirectMS1Quant: Ultrafast Quantitative Proteomics with MS/MS-Free Mass Spectrometry. https://pubs.acs.org/doi/10.1021/acs.analchem.2c02255
 21 | 
 22 | Ivanov et al. Boosting MS1-only Proteomics with Machine Learning Allows 2000 Protein Identifications in Single-Shot Human Proteome Analysis Using 5 min HPLC Gradient. https://doi.org/10.1021/acs.jproteome.0c00863
 23 | 
 24 | Ivanov et al. DirectMS1: MS/MS-free identification of 1000 proteins of cellular proteomes in 5 minutes. https://doi.org/10.1021/acs.analchem.9b05095
 25 | 
 26 | ## Installation
 27 | 
 28 | It is recommended to additionally install [DeepLC](https://github.com/compomics/DeepLC) version 1.1.2.2 (unofficial fork with small changes). Newer version has some issues right now. It is highly important to install DeepLC before ms1searchpy for outdated packages compatibility!
 29 | 
 30 |     pip install https://github.com/markmipt/DeepLC/archive/refs/heads/alternative_best_model.zip
 31 | 
 32 | After that, install ms1searchpy:
 33 | 
 34 |     pip install ms1searchpy
 35 | 
 36 | This should work on recent versions of Python (3.8-3.10).
 37 | 
 38 | ## Usage tutorial: protein identification
 39 | 
 40 | The script used for protein identification is called `ms1searchpy`. It needs input files (mzML or tsv) and a FASTA database.
 41 | 
 42 | ### Input files
 43 | 
 44 | If mzML are provided, ms1searchpy will invoke [biosaur2](https://github.com/markmipt/biosaur2) to generate the features table.
 45 | You can also use other software like [Dinosaur](https://github.com/fickludd/dinosaur) or [Biosaur](https://github.com/abdrakhimov1/Biosaur),
 46 | but [biosaur2](https://github.com/markmipt/biosaur2) is recommended. You can also make it yourself,
 47 | the table must contain columns 'massCalib', 'rtApex', 'charge' and 'nIsotopes' columns.
 48 | 
 49 | #### How to get mzML files
 50 | 
 51 | To get mzML from RAW files, you can use [Proteowizard MSConvert](https://proteowizard.sourceforge.io/download.html)...
 52 | 
 53 |     msconvert path_to_file.raw -o path_to_output_folder --mzML --filter "peakPicking true 1-" --filter "MS2Deisotope" --filter "zeroSamples removeExtra" --filter "threshold absolute 1 most-intense"
 54 | 
 55 | ...or [compomics ThermoRawFileParser](https://github.com/compomics/ThermoRawFileParser), which produces suitable files
 56 | with default parameters.
 57 | 
 58 | ### RT predictor
 59 | 
 60 | For protein identification, `ms1searchpy` needs a retention time prediction model. The recommended one is [DeepLC](https://github.com/compomics/DeepLC),
 61 | but you can also use built-in additive model (default).
 62 | 
 63 | ### Examples
 64 | 
 65 |     biosaur2 test.mzML -minlh 3
 66 |     ms1searchpy test.features.tsv -d sprot_human.fasta -deeplc 1 -ad 1
 67 | 
 68 | The first command will run `biosaur2` to detect all peptide isotopic clusters which are visible in at least 3 consecutive MS1 scans. The second command will run `ms1searchpy` with DeepLC RT predictor available as `deeplc` (should work if you install DeepLC
 69 | alongside `ms1searchpy`. `-ad 1` creates a shuffled decoy database for FDR estimation.
 70 | You should use it only once and just use the created database for other searches.
 71 | 
 72 | ### Output files
 73 | 
 74 | `ms1searchpy` produces several tables:
 75 |  - identified proteins, FDR-filtered (`sample.features_proteins.tsv`) - this is the main result;
 76 |  - all identified proteins (`sample.features_proteins_full.tsv`);
 77 |  - all identified proteins based on all PFMs (`sample.features_proteins_full_noexclusion.tsv`);
 78 |  - all matched peptide match fingerprints, or peptide-feature matches (`sample.features_PFMs.tsv`);
 79 |  - all PFMs with features prepared for Machnine Learning (`sample.features_PFMs_ML.tsv`);
 80 |  - number of theoretical peptides per protein (`sample.features_protsN.tsv`);
 81 |  - log file with estimated mass and RT accuracies (`sample.features_log.txt`).
 82 | 
 83 | ### Combine results from replicates
 84 | 
 85 | You can combine the results from several replicate runs with `ms1combine` by feeding it `_PFMs_ML.tsv` tables:
 86 | 
 87 |     ms1combine sample_rep_*.features_PFMs_ML.tsv
 88 | 
 89 | ### Using Group-specific FDR for metaproteomics
 90 | 
 91 | Group-specific FDR for metaproteomics should be used for accurate estimation of protein identified among the different groups. The command ms1groups should be used for that:
 92 | 
 93 |      ms1groups F04.features_PFMs_ML.tsv -d F04_top15_shuffled.fasta -out group_statistics_by -fdr 5.0 -groups genus
 94 | 
 95 | It produces a table with the number of identified proteins for each group using group-specific FDR. This is basically multiple DirectMS1 searches with small protein databases containing only a single group and combining results all together. However, using the mentioned “ms1groups” script and preliminary DirectMS1 search, two problems are solved: small statistics for all mass/RT/Machine Learning calibration procedures within DirectMS1 workflow for low-populated groups and computational time. Currently supported groups are 'species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom', 'domain'. The groups are automatically extracted using ete3 Python module and NCBI Taxonomic database. Also, the script supports groups dbname and OX: the the former is a taxonomy in swiss-prot protein name (_HUMAN, _YEAST, etc.) and the latter is the taxonomy provided by 'OX=' from protein description in the fasta file.
 96 | 
 97 | ## Usage tutorial: Quantitation
 98 | 
 99 | After obtaining the protein identification results, you can proceed to compare your samples using LFQ.
100 | 
101 | ### Using directms1quant
102 | 
103 | New LFQ method designed specifically for DirectMS1 is invoked like this:
104 | 
105 |     directms1quant -S1 sample1_r{1,2,3}.features_proteins_full.tsv -S2 sample2_r{1,2,3}.features_proteins_full.tsv
106 | 
107 | It produces a filtered table of significantly changed proteins with p-values and fold changes,
108 | as well as the full protein table and a separate file simply listing all
109 | IDs of significantly modified proteins (e.g. for easy copy-paste into a StringDB search window).
110 | 
111 | 
112 | ### Multi-condition protein profiling using directms1quantmulti
113 | 
114 | You can make a quantitation for complex projects using script directms1quantmulti. The example below is shown for our project of time-series profiling of glioblastoma cell line under interferon treatment.
115 | 
116 | Script takes a tab-separated table (.tsv) with details for all project files. An example of a sample file table is available here in the examples folder. It should contain the following columns:
117 | 
118 | File Name - filename of raw file. For example, “QEHFX_JB_000379”.
119 | 
120 | group - sample group of file. In our example, there are K (Control group), IFN30 (treatment with 30 units/ml of interferon) and IFN1000 groups. The first group mentioned in the table will be used as control for pairwise directms1quant runs.  
121 | 
122 | condition - sample subgroup of file. In our example, there are multiple time points after treatment, such as 0h, 30min, 1h, 2h, etc. By default, only the same conditions will be used for pairwise comparisons. For example, IFN30 0h vs K 0h; IFN1000 0h vs K 0h, etc.
123 | 
124 | vs - column for specific condition comparison. For example, in our case, we did not have control samples at the 30 min time point. Thus, we would like to proceed directms1quant runs for IFN30 30 min vs K 0h; and IFN1000 30 min vs K 0h comparisons. Thus, for the 30 min IFN30 and IFN1000 files we put “0h” in the “vs” column. See example table for details.
125 | 
126 | replicate - column for replicate number of specific condition and sample group.
127 | 
128 | BatchMS - column for mass-spectrometry Batch. This parameter is used for extra normalization within a batch.
129 | 
130 | 
131 | The script consists of four different stages and you can rerun the script without rerunning previous stages (“-start_stage” option).
132 | 
133 | Stage 1 is a set of pairwise DirectMS1Quant runs for different interferon treatment conditions versus control samples.
134 | 
135 | Stage 2 is preparation of peptide LFQ table for all files using the results obtained in the previous step.
136 | 
137 | Stage 3 is preparation of the protein LFQ table. Only the peptides labeled by DirectMS1Quant as significantly different between samples in at least X pairwise comparisons are used for protein quantitation. The X parameter is controlled by “min_signif_for_pept” option.
138 | 
139 | Stage 4 is preparation of LFQ profiling figures for proteins specified in the file under “proteins_for_figure” option. The file should be a tsv table with column “dbname” containing protein database names in the swiss-prot format. Any default directms1quant output table with differentially expressed proteins can be used here.
140 | 
141 | 
142 | Example of script usage::
143 | 
144 |     directms1quantmulti -db ~/fasta_folder/sprot_human_shuffled.fasta -pdir ~/folder_with_ms1searchpy_results/ -samples ~/samples.csv -min_signif_for_pept 2 -out DQmulti_2024 -pep_min_non_missing_samples 0.75 -start_stage 1 -proteins_for_figure ~/custom_list_of_proteins.tsv -figdir ~/output_figure_folder/
145 | 
146 | ## Links
147 | 
148 | - GitHub repo & issue tracker: https://github.com/markmipt/ms1searchpy
149 | - Mailing list: markmipt@gmail.com
150 | 
151 | - DeepLC repo: https://github.com/compomics/DeepLC
152 | 


--------------------------------------------------------------------------------
/VERSION:
--------------------------------------------------------------------------------
1 | 2.8.12
2 | 


--------------------------------------------------------------------------------
/examples/DQ2_2024_P08758.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/markmipt/ms1searchpy/ab7dd1aba1a513f71d028263972fd5e4c79e7090/examples/DQ2_2024_P08758.png


--------------------------------------------------------------------------------
/examples/custom_list_of_proteins.tsv:
--------------------------------------------------------------------------------
1 | dbname
2 | sp|P08758|ANXA5_HUMAN
3 | sp|P04083|ANXA1_HUMAN
4 | sp|P07355|ANXA2_HUMAN
5 | sp|P08133|ANXA6_HUMAN
6 | 


--------------------------------------------------------------------------------
/examples/samples.tsv:
--------------------------------------------------------------------------------
  1 | File Name	group	condition	replicate	BatchMS	vs
  2 | QEHFX_JB_000379	K	0h	1	1	
  3 | QEHFX_JB_000380	K	0h	2	1	
  4 | QEHFX_JB_000381	K	0h	3	1	
  5 | QEHFX_JB_000382	K	0h	4	1	
  6 | QEHFX_JB_000383	K	0h	5	1	
  7 | QEHFX_JB_000384	K	24h	1	1	
  8 | QEHFX_JB_000385	K	24h	2	1	
  9 | QEHFX_JB_000386	K	24h	3	1	
 10 | QEHFX_JB_000387	K	24h	4	1	
 11 | QEHFX_JB_000388	K	24h	5	1	
 12 | QEHFX_JB_000389	K	12h	1	1	
 13 | QEHFX_JB_000390	K	12h	2	1	
 14 | QEHFX_JB_000391	K	12h	3	1	
 15 | QEHFX_JB_000392	K	12h	4	1	
 16 | QEHFX_JB_000393	K	12h	5	1	
 17 | QEHFX_JB_000394	K	2h	1	1	
 18 | QEHFX_JB_000395	K	2h	2	1	
 19 | QEHFX_JB_000396	K	2h	3	1	
 20 | QEHFX_JB_000397	K	2h	4	1	
 21 | QEHFX_JB_000398	K	2h	5	1	
 22 | QEHFX_JB_000399	IFN30	24h	1	1	
 23 | QEHFX_JB_000400	IFN30	24h	2	1	
 24 | QEHFX_JB_000401	IFN30	24h	3	1	
 25 | QEHFX_JB_000402	IFN30	24h	4	1	
 26 | QEHFX_JB_000403	IFN30	24h	5	1	
 27 | QEHFX_JB_000404	IFN30	16h	1	1	12h
 28 | QEHFX_JB_000405	IFN30	16h	2	1	12h
 29 | QEHFX_JB_000406	IFN30	16h	3	1	12h
 30 | QEHFX_JB_000407	IFN30	16h	4	1	12h
 31 | QEHFX_JB_000408	IFN30	16h	5	1	12h
 32 | QEHFX_JB_000409	IFN30	20h	1	1	24h
 33 | QEHFX_JB_000410	IFN30	20h	2	1	24h
 34 | QEHFX_JB_000411	IFN30	20h	3	1	24h
 35 | QEHFX_JB_000412	IFN30	20h	4	1	24h
 36 | QEHFX_JB_000413	IFN30	20h	5	1	24h
 37 | QEHFX_JB_000414	IFN30	4h	1	1	2h
 38 | QEHFX_JB_000415	IFN30	4h	2	1	2h
 39 | QEHFX_JB_000416	IFN30	4h	3	1	2h
 40 | QEHFX_JB_000417	IFN30	4h	4	1	2h
 41 | QEHFX_JB_000418	IFN30	4h	5	1	2h
 42 | QEHFX_JB_000419	IFN30	7h	1	1	12h
 43 | QEHFX_JB_000420	IFN30	7h	2	1	12h
 44 | QEHFX_JB_000421	IFN30	7h	3	1	12h
 45 | QEHFX_JB_000422	IFN30	7h	4	1	12h
 46 | QEHFX_JB_000423	IFN30	7h	5	1	12h
 47 | QEHFX_JB_000424	IFN30	30min	1	1	0h
 48 | QEHFX_JB_000425	IFN30	30min	2	1	0h
 49 | QEHFX_JB_000426	IFN30	30min	3	1	0h
 50 | QEHFX_JB_000427	IFN30	30min	4	1	0h
 51 | QEHFX_JB_000428	IFN30	30min	5	1	0h
 52 | QEHFX_JB_000429	IFN30	1h	1	1	0h
 53 | QEHFX_JB_000430	IFN30	1h	2	1	0h
 54 | QEHFX_JB_000431	IFN30	1h	3	1	0h
 55 | QEHFX_JB_000432	IFN30	1h	4	1	0h
 56 | QEHFX_JB_000433	IFN30	1h	5	1	0h
 57 | QEHFX_JB_000434	IFN30	3h	1	1	2h
 58 | QEHFX_JB_000435	IFN30	3h	2	1	2h
 59 | QEHFX_JB_000436	IFN30	3h	3	1	2h
 60 | QEHFX_JB_000437	IFN30	3h	4	1	2h
 61 | QEHFX_JB_000442	IFN30	0h	1	1	2h
 62 | QEHFX_JB_000443	IFN30	0h	2	1	
 63 | QEHFX_JB_000444	IFN30	0h	3	1	
 64 | QEHFX_JB_000445	IFN30	0h	4	1	
 65 | QEHFX_JB_000446	IFN30	0h	5	1	
 66 | QEHFX_JB_000447	IFN30	2h	1	1	2h
 67 | QEHFX_JB_000448	IFN30	2h	2	1	2h
 68 | QEHFX_JB_000449	IFN30	2h	3	1	2h
 69 | QEHFX_JB_000450	IFN30	2h	4	1	2h
 70 | QEHFX_JB_000451	IFN30	2h	5	1	2h
 71 | QEHFX_JB_000452	IFN30	12h	1	1	12h
 72 | QEHFX_JB_000453	IFN30	12h	2	1	12h
 73 | QEHFX_JB_000454	IFN30	12h	3	1	12h
 74 | QEHFX_JB_000455	IFN30	12h	4	1	12h
 75 | QEHFX_JB_000456	IFN30	12h	5	1	12h
 76 | QEHFX_JB_000457	IFN1000	16h	1	1	12h
 77 | QEHFX_JB_000458	IFN1000	16h	2	1	12h
 78 | QEHFX_JB_000459	IFN1000	16h	3	1	12h
 79 | QEHFX_JB_000460	IFN1000	16h	4	1	12h
 80 | QEHFX_JB_000461	IFN1000	16h	5	1	12h
 81 | QEHFX_JB_000467	IFN1000	20h	1	1	24h
 82 | QEHFX_JB_000468	IFN1000	20h	2	1	24h
 83 | QEHFX_JB_000469	IFN1000	20h	3	1	24h
 84 | QEHFX_JB_000470	IFN1000	20h	4	1	24h
 85 | QEHFX_JB_000471	IFN1000	20h	5	1	24h
 86 | QEHFX_JB_000472	IFN1000	24h	1	1	24h
 87 | QEHFX_JB_000473	IFN1000	24h	2	1	24h
 88 | QEHFX_JB_000474	IFN1000	24h	3	1	24h
 89 | QEHFX_JB_000475	IFN1000	24h	4	1	24h
 90 | QEHFX_JB_000476	IFN1000	24h	5	1	24h
 91 | QEHFX_JB_000477	IFN1000	12h	1	1	12h
 92 | QEHFX_JB_000478	IFN1000	12h	2	1	12h
 93 | QEHFX_JB_000479	IFN1000	12h	3	1	12h
 94 | QEHFX_JB_000480	IFN1000	12h	4	1	12h
 95 | QEHFX_JB_000481	IFN1000	12h	5	1	12h
 96 | QEHFX_JB_000482	IFN1000	0h	1	1	0h
 97 | QEHFX_JB_000483	IFN1000	0h	2	1	0h
 98 | QEHFX_JB_000484	IFN1000	0h	3	1	0h
 99 | QEHFX_JB_000485	IFN1000	0h	4	1	0h
100 | QEHFX_JB_000486	IFN1000	0h	5	1	0h
101 | QEHFX_JB_000487	IFN1000	3h	1	1	2h
102 | QEHFX_JB_000488	IFN1000	3h	2	1	2h
103 | QEHFX_JB_000489	IFN1000	3h	3	1	2h
104 | QEHFX_JB_000490	IFN1000	3h	4	1	2h
105 | QEHFX_JB_000491	IFN1000	4h	1	1	2h
106 | QEHFX_JB_000492	IFN1000	4h	2	1	2h
107 | QEHFX_JB_000493	IFN1000	4h	3	1	2h
108 | QEHFX_JB_000494	IFN1000	4h	4	1	2h
109 | QEHFX_JB_000495	IFN1000	4h	5	1	2h
110 | QEHFX_JB_000496	IFN1000	2h	1	1	2h
111 | QEHFX_JB_000497	IFN1000	2h	2	1	2h
112 | QEHFX_JB_000498	IFN1000	2h	3	1	2h
113 | QEHFX_JB_000499	IFN1000	2h	4	1	2h
114 | QEHFX_JB_000500	IFN1000	2h	5	1	2h
115 | QEHFX_JB_000501	IFN1000	1h	1	1	0h
116 | QEHFX_JB_000502	IFN1000	1h	2	1	0h
117 | QEHFX_JB_000503	IFN1000	1h	3	1	0h
118 | QEHFX_JB_000504	IFN1000	1h	4	1	0h
119 | QEHFX_JB_000505	IFN1000	1h	5	1	0h
120 | QEHFX_JB_000506	IFN1000	7h	1	1	12h
121 | QEHFX_JB_000507	IFN1000	7h	2	1	12h
122 | QEHFX_JB_000508	IFN1000	7h	3	1	12h
123 | QEHFX_JB_000509	IFN1000	7h	4	1	12h
124 | QEHFX_JB_000510	IFN1000	7h	5	1	12h
125 | QEHFX_JB_000511	IFN1000	30min	1	1	0h
126 | QEHFX_JB_000512	IFN1000	30min	2	1	0h
127 | QEHFX_JB_000513	IFN1000	30min	3	1	0h
128 | QEHFX_JB_000514	IFN1000	30min	4	1	0h
129 | QEHFX_JB_000515	IFN1000	30min	5	1	0h
130 | 


--------------------------------------------------------------------------------
/ms1searchpy/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/markmipt/ms1searchpy/ab7dd1aba1a513f71d028263972fd5e4c79e7090/ms1searchpy/__init__.py


--------------------------------------------------------------------------------
/ms1searchpy/combine.py:
--------------------------------------------------------------------------------
  1 | from .main import final_iteration, filter_results
  2 | import pandas as pd
  3 | from collections import defaultdict
  4 | import argparse
  5 | import logging
  6 | 
  7 | logger = logging.getLogger(__name__)
  8 | 
  9 | def run():
 10 |     parser = argparse.ArgumentParser(
 11 |         description='Combine DirectMS1 search results',
 12 |         epilog='''
 13 | 
 14 |     Example usage
 15 |     -------------
 16 |     $ ms1combine.py file1_PFMs_ML.tsv ... filen_PFMs_ML.tsv
 17 |     -------------
 18 |     ''',
 19 |         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 20 | 
 21 |     parser.add_argument('file', nargs='+', help='input tsv PFMs_ML files')
 22 |     parser.add_argument('-out', help='prefix output file names', default='combined')
 23 |     parser.add_argument('-prots_full', help='path to any of *_proteins_full.tsv file. By default this file will be searched in the folder with PFMs_ML files', default='')
 24 |     parser.add_argument('-fdr', help='protein fdr filter in %%', default=1.0, type=float)
 25 |     parser.add_argument('-prefix', help='decoy prefix', default='DECOY_')
 26 |     parser.add_argument('-nproc', help='number of processes', default=1, type=int)
 27 |     parser.add_argument('-pp', help='protein priority table for keeping protein groups when merge results by scoring', default='')
 28 |     args = vars(parser.parse_args())
 29 |     logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s',
 30 |             datefmt='[%H:%M:%S]', level=logging.INFO)
 31 |     process_files(args)
 32 | 
 33 | 
 34 | def process_files(args):
 35 |     d_tmp = dict()
 36 | 
 37 |     df1 = None
 38 |     for idx, filen in enumerate(args['file']):
 39 |         logger.info('Reading file %s' % (filen, ))
 40 |         df3 = pd.read_csv(filen, sep='\t', usecols=['ids', 'qpreds', 'preds', 'decoy', 'seqs', 'proteins', 'peptide', 'iorig'])
 41 |         df3['ids'] = df3['ids'].apply(lambda x: '%d:%s' % (idx, str(x)))
 42 |         df3['fidx'] = idx
 43 | 
 44 |         df3 = df3[df3['qpreds'] <= 10]
 45 | 
 46 | 
 47 |         qval_ok = 0
 48 |         for qval_cur in range(10):
 49 |             if qval_cur != 10:
 50 |                 df1ut = df3[df3['qpreds'] == qval_cur]
 51 |                 decoy_ratio = df1ut['decoy'].sum() / len(df1ut)
 52 |                 d_tmp[(idx, qval_cur)] = decoy_ratio
 53 |                 # print(filen, qval_cur, decoy_ratio)
 54 | 
 55 |         if df1 is None:
 56 |             df1 = df3
 57 |             if args['prots_full']:
 58 |                 df2 = pd.read_csv(args['prots_full'], sep='\t')
 59 |             else:
 60 |                 try:
 61 |                     df2 = pd.read_csv(filen.replace('_PFMs_ML.tsv', '_proteins_full.tsv'), sep='\t')
 62 |                 except:
 63 |                     logging.critical('Proteins_full file is missing!')
 64 |                     break
 65 | 
 66 |         else:
 67 |             df1 = pd.concat([df1, df3], ignore_index=True)
 68 | 
 69 |     d_tmp = [z[0] for z in sorted(d_tmp.items(), key=lambda x: x[1])]
 70 |     qdict = {}
 71 |     for idx, val in enumerate(d_tmp):
 72 |         qdict[val] = int(idx / len(args['file']))
 73 |     df1['qpreds'] = df1.apply(lambda x: qdict[(x['fidx'], x['qpreds'])], axis=1)
 74 | 
 75 | 
 76 |     pept_prot = defaultdict(set)
 77 |     for seq, prots in df1[['seqs', 'proteins']].values:
 78 |         for dbname in prots.split(';'):
 79 |             pept_prot[seq].add(dbname)
 80 | 
 81 |     protsN = dict()
 82 |     for dbname, theorpept in df2[['dbname', 'theoretical peptides']].values:
 83 |         protsN[dbname] = theorpept
 84 | 
 85 | 
 86 |     prefix = args['prefix']
 87 |     isdecoy = lambda x: x[0].startswith(prefix)
 88 |     isdecoy_key = lambda x: x.startswith(prefix)
 89 |     escore = lambda x: -x[1]
 90 |     fdr = float(args['fdr']) / 100
 91 | 
 92 |     resdict = dict()
 93 | 
 94 |     resdict['qpreds'] = df1['qpreds'].values
 95 |     resdict['preds'] = df1['preds'].values
 96 |     resdict['seqs'] = df1['peptide'].values
 97 |     resdict['ids'] = df1['ids'].values
 98 |     resdict['iorig'] = df1['iorig'].values
 99 | 
100 |     # mass_diff = resdict['qpreds']
101 |     # rt_diff = resdict['qpreds']
102 | 
103 |     base_out_name = args['out']
104 | 
105 |     e_ind = resdict['qpreds'] <= 9
106 |     resdict = filter_results(resdict, e_ind)
107 | 
108 |     mass_diff = resdict['qpreds']
109 |     rt_diff = resdict['qpreds']
110 | 
111 |     if args['pp']:
112 |         df4 = pd.read_table(args['pp'])
113 |         prots_spc_basic2 = df4.set_index('dbname')['score'].to_dict()
114 |     else:
115 |         prots_spc_basic2 = False
116 | 
117 |     final_iteration(resdict, mass_diff, rt_diff, pept_prot, protsN, base_out_name, prefix, isdecoy, isdecoy_key, escore, fdr, args['nproc'], prots_spc_basic2=prots_spc_basic2)
118 | 
119 | 
120 | if __name__ == '__main__':
121 |     run()
122 | 


--------------------------------------------------------------------------------
/ms1searchpy/combine_proteins.py:
--------------------------------------------------------------------------------
 1 | from pyteomics import auxiliary as aux
 2 | import pandas as pd
 3 | import argparse
 4 | import logging
 5 | 
 6 | logger = logging.getLogger(__name__)
 7 | 
 8 | def run():
 9 |     parser = argparse.ArgumentParser(
10 |         description='Combine DirectMS1 search results',
11 |         epilog='''
12 | 
13 |     Example usage
14 |     -------------
15 |     $ ms1combine_proteins file1_proteins_full.tsv ... filen_proteins_full.tsv
16 |     -------------
17 |     ''',
18 |         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
19 | 
20 |     parser.add_argument('file', nargs='+', help='input tsv proteins_full files')
21 |     parser.add_argument('-out', help='prefix for joint file name', default='combined')
22 |     parser.add_argument('-fdr', help='protein fdr filter in %%', default=1.0, type=float)
23 |     parser.add_argument('-prefix', help='decoy prefix', default='DECOY_')
24 |     args = vars(parser.parse_args())
25 |     logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s',
26 |             datefmt='[%H:%M:%S]', level=logging.INFO)
27 | 
28 |     tmp_list = []
29 |     for idx, filen in enumerate(args['file']):
30 |         tmp_list.append(pd.read_csv(filen, sep='\t'))
31 |     
32 |     df1 = pd.concat(tmp_list).groupby(['dbname']).sum().reset_index()
33 |     out_name = '%s.features_proteins.tsv' % (args['out'], )
34 | 
35 | 
36 | 
37 | 
38 |     prots_spc = df1.set_index('dbname').score.to_dict()
39 |     fdr = args['fdr'] / 100
40 | 
41 |     prefix = args['prefix']
42 |     isdecoy = lambda x: x[0].startswith(prefix)
43 |     isdecoy_key = lambda x: x.startswith(prefix)
44 |     escore = lambda x: -x[1]
45 | 
46 | 
47 | 
48 |     checked = set()
49 |     for k, v in list(prots_spc.items()):
50 |         if k not in checked:
51 |             if isdecoy_key(k):
52 |                 if prots_spc.get(k.replace(prefix, ''), -1e6) > v:
53 |                     del prots_spc[k]
54 |                     checked.add(k.replace(prefix, ''))
55 |             else:
56 |                 if prots_spc.get(prefix + k, -1e6) > v:
57 |                     del prots_spc[k]
58 |                     checked.add(prefix + k)
59 | 
60 |     filtered_prots = aux.filter(prots_spc.items(), fdr=fdr, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1, full_output=True, correction=1)
61 |     if len(filtered_prots) < 1:
62 |         filtered_prots = aux.filter(prots_spc.items(), fdr=fdr, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1, full_output=True, correction=0)
63 |     identified_proteins = 0
64 | 
65 |     for x in filtered_prots:
66 |         identified_proteins += 1
67 | 
68 |     logger.info('TOP 5 identified proteins:')
69 |     logger.info('dbname\tscore')
70 |     for x in filtered_prots[:5]:
71 |         logger.info('\t'.join((str(x[0]), str(x[1]))))
72 |     logger.info('Final joint search: identified proteins = %d', len(filtered_prots))
73 | 
74 |     df1 = pd.DataFrame.from_records(filtered_prots, columns=['dbname', 'score'])
75 |     df1.to_csv(out_name, index=False, sep='\t')
76 | 
77 | if __name__ == '__main__':
78 |     run()
79 | 


--------------------------------------------------------------------------------
/ms1searchpy/directms1quant.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division
  2 | import argparse
  3 | import pandas as pd
  4 | import numpy as np
  5 | from scipy.stats import binom, ttest_ind
  6 | from scipy.optimize import curve_fit
  7 | import logging
  8 | from pyteomics import fasta
  9 | from collections import Counter, defaultdict
 10 | import random
 11 | random.seed(42)
 12 | 
 13 | logger = logging.getLogger(__name__)
 14 | 
 15 | 
 16 | def get_df_final(args, replace_label, allowed_peptides, allowed_prots_all, pep_RT=False, RT_threshold=False):
 17 | 
 18 |     df_final = False
 19 |     for i in range(1, 3, 1):
 20 |         sample_num = 'S%d' % (i, )
 21 |         if args.get(sample_num, 0):
 22 |             for z in args[sample_num]:
 23 |                 label = sample_num + '_' + z.replace(replace_label, '')
 24 |                 df3 = pd.read_table(z.replace(replace_label, '_PFMs.tsv'), usecols=['sequence', 'proteins', 'charge', 'ion_mobility', 'Intensity', 'RT'])
 25 | 
 26 | 
 27 |                 if not args['allowed_peptides']:
 28 |                     df3['tmpseq'] = df3['sequence']
 29 |                     df3 = df3[df3['tmpseq'].apply(lambda x: x in allowed_peptides)]
 30 |                 else:
 31 |                     df3 = df3[df3['sequence'].apply(lambda x: x in allowed_peptides)]
 32 | 
 33 | 
 34 |                 df3 = df3[df3['proteins'].apply(lambda x: any(z in allowed_prots_all for z in x.split(';')))]
 35 |                 df3['proteins'] = df3['proteins'].apply(lambda x: ';'.join([z for z in x.split(';') if z in allowed_prots_all]))
 36 | 
 37 |                 df3['origseq'] = df3['sequence']
 38 |                 df3['sequence'] = df3['sequence'] + df3['charge'].astype(int).astype(str) + df3['ion_mobility'].astype(str)
 39 | 
 40 |                 if pep_RT is not False:
 41 |                     df3 = df3[df3.apply(lambda x: abs(pep_RT[x['sequence']] - x['RT']) <= 3 * RT_threshold, axis=1)]
 42 | 
 43 |                 df3 = df3.sort_values(by='Intensity', ascending=False)
 44 | 
 45 |                 df3 = df3.drop_duplicates(subset='sequence')
 46 | 
 47 |                 df3[label] = df3['Intensity']
 48 |                 df3['protein'] = df3['proteins']
 49 |                 df3['peptide'] = df3['sequence']
 50 |                 if pep_RT is False:
 51 |                     df3['RT_'+label] = df3['RT']
 52 |                     df3 = df3[['origseq', 'peptide', 'protein', label, 'RT_'+label]]
 53 |                 else:
 54 |                     df3 = df3[['origseq', 'peptide', 'protein', label]]
 55 | 
 56 | 
 57 |                 if df_final is False:
 58 |                     df_final = df3.reset_index(drop=True)
 59 |                 else:
 60 |                     df_final = df_final.reset_index(drop=True).merge(df3.reset_index(drop=True), on='peptide', how='outer')
 61 |                     df_final.protein_x = df_final.protein_x.fillna(value=df_final.protein_y)
 62 |                     df_final.origseq_x = df_final.origseq_x.fillna(value=df_final.origseq_y)
 63 |                     df_final['protein'] = df_final['protein_x']
 64 |                     df_final['origseq'] = df_final['origseq_x']
 65 | 
 66 |                     df_final = df_final.drop(columns=['protein_x', 'protein_y'])
 67 |                     df_final = df_final.drop(columns=['origseq_x', 'origseq_y'])
 68 | 
 69 |     return df_final
 70 | 
 71 | 
 72 | def calc_sf_all(v, n, p):
 73 |     sf_values = -np.log10(binom.sf(v-1, n, p))
 74 |     sf_values[v <= 2] = 0
 75 |     sf_values[np.isinf(sf_values)] = 20
 76 |     sf_values[n == 0] = 0
 77 |     return sf_values
 78 | 
 79 | def noisygaus(x, a, x0, sigma, b):
 80 |     return a * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2)) + b
 81 | 
 82 | def calibrate_mass(bwidth, mass_left, mass_right, true_md):
 83 | 
 84 |     bbins = np.arange(-mass_left, mass_right, bwidth)
 85 |     H1, b1 = np.histogram(true_md, bins=bbins)
 86 |     b1 = b1 + bwidth
 87 |     b1 = b1[:-1]
 88 | 
 89 | 
 90 |     popt, pcov = curve_fit(noisygaus, b1, H1, p0=[1, np.median(true_md), 1, 1])
 91 |     mass_shift, mass_sigma = popt[1], abs(popt[2])
 92 |     return mass_shift, mass_sigma, pcov[0][0]
 93 | 
 94 | def run():
 95 |     parser = argparse.ArgumentParser(
 96 |         description='run DirectMS1quant for ms1searchpy results',
 97 |         epilog='''
 98 | 
 99 |     Example usage
100 |     -------------
101 |     $ directms1quant -S1 sample1_1_proteins_full.tsv sample1_n_proteins_full.tsv -S2 sample2_1_proteins_full.tsv sample2_n_proteins_full.tsv
102 |     -------------
103 |     ''',
104 |         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
105 | 
106 |     parser.add_argument('-S1', nargs='+', help='input files for S1 sample', required=True)
107 |     parser.add_argument('-S2', nargs='+', help='input files for S2 sample', required=True)
108 |     parser.add_argument('-out', help='name of DirectMS1quant output file', default='directms1quant_out')
109 |     parser.add_argument('-min_samples', help='minimum number of samples for peptide usage. 0 means 50%% of input files', default=0)
110 |     parser.add_argument('-fold_change', help='FC threshold standard deviations', default=2.0, type=float)
111 |     parser.add_argument('-fold_change_abs', help='Use absolute log2 scale FC threshold instead of standard deviations', action='store_true')
112 |     # parser.add_argument('-bp', help='Experimental. Better percentage', default=80, type=int)
113 |     parser.add_argument('-minl', help='Min peptide length for quantitation', default=7, type=int)
114 |     parser.add_argument('-qval', help='qvalue threshold', default=0.05, type=float)
115 |     parser.add_argument('-intensity_norm', help='Intensity normalization: 0-none, 1-median, 2-sum 1000 most intense peptides (default)', default=2, type=int)
116 |     parser.add_argument('-all_proteins', help='use all proteins instead of FDR controlled', action='store_true')
117 |     parser.add_argument('-all_pfms', help='use all PFMs instead of ML controlled', action='store_true')
118 |     parser.add_argument('-allowed_peptides', help='path to allowed peptides')
119 |     parser.add_argument('-allowed_proteins', help='path to allowed proteins')
120 |     parser.add_argument('-protein_shifts', help='Experimental. path to protein shifts')
121 |     parser.add_argument('-d', '-db', help='path to uniprot fasta file for gene annotation')
122 |     parser.add_argument('-prefix', help='Decoy prefix. Default DECOY_', default='DECOY_', type=str)
123 |     args = vars(parser.parse_args())
124 |     logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s',
125 |             datefmt='[%H:%M:%S]', level=logging.INFO)
126 | 
127 |     process_files(args)
128 | 
129 | 
130 | def process_files(args):
131 |     replace_label = '_proteins_full.tsv'
132 |     decoy_prefix = args['prefix']
133 | 
134 |     fold_change = float(args['fold_change'])
135 | 
136 |     all_s_lbls = {}
137 | 
138 |     logger.info('Starting analysis...')
139 | 
140 |     allowed_prots = set()
141 |     allowed_prots_all = set()
142 |     allowed_peptides = set()
143 | 
144 |     # cnt0 = Counter()
145 | 
146 |     cnt_file = 0
147 | 
148 |     for i in range(1, 3, 1):
149 |         sample_num = 'S%d' % (i, )
150 |         if args[sample_num]:
151 | 
152 | 
153 |             all_s_lbls[sample_num] = []
154 | 
155 |             for z in args[sample_num]:
156 | 
157 |                 cnt_file += 1
158 |                 logger.debug('Processing file %d', cnt_file)
159 | 
160 |                 label = sample_num + '_' + z.replace(replace_label, '')
161 |                 all_s_lbls[sample_num].append(label)
162 | 
163 |                 if not args['allowed_proteins']:
164 |                     if not args['all_proteins']:
165 |                         df0 = pd.read_table(z.replace('_proteins_full.tsv', '_proteins.tsv'), usecols=['dbname', ])
166 |                         allowed_prots.update(df0['dbname'])
167 |                         allowed_prots.update([decoy_prefix + z for z in df0['dbname'].values])
168 |                     else:
169 |                         df0 = pd.read_table(z, usecols=['dbname', ])
170 |                         allowed_prots.update(df0['dbname'])
171 | 
172 |                 if not args['allowed_peptides']:
173 |                     df0 = pd.read_table(z.replace('_proteins_full.tsv', '_PFMs_ML.tsv'), usecols=['seqs', 'qpreds', 'plen'])
174 | 
175 | 
176 |                     if not args['all_pfms']:
177 |                         df0 = df0[df0['qpreds'] <= 10]
178 | 
179 |                     df0 = df0[df0['plen'] >= args['minl']]
180 |                     # df0['seqs'] = df0['seqs']
181 |                     allowed_peptides.update(df0['seqs'])
182 |                     # cnt0.update(df0['seqs'])
183 | 
184 |     if args['allowed_proteins']:
185 |         allowed_prots = set(z.strip() for z in open(args['allowed_proteins'], 'r').readlines())
186 |         allowed_prots.update([decoy_prefix + z for z in allowed_prots])
187 | 
188 |     if args['allowed_peptides']:
189 |         allowed_peptides = set(z.strip() for z in open(args['allowed_peptides'], 'r').readlines())
190 |     else:
191 |         allowed_peptides = allowed_peptides
192 | 
193 |     logger.info('Total number of TARGET protein GROUPS: %d', len(allowed_prots) / 2)
194 | 
195 |     for i in range(1, 3, 1):
196 |         sample_num = 'S%d' % (i, )
197 |         if args.get(sample_num, 0):
198 |             for z in args[sample_num]:
199 |                 df3 = pd.read_table(z.replace(replace_label, '_PFMs.tsv'), usecols=['sequence', 'proteins', ])
200 |                 df3 = df3[df3['sequence'].apply(lambda x: x in allowed_peptides)]
201 | 
202 |                 df3_tmp = df3[df3['proteins'].apply(lambda x: any(z in allowed_prots for z in x.split(';')))]
203 |                 for dbnames in set(df3_tmp['proteins'].values):
204 |                     for dbname in dbnames.split(';'):
205 |                         allowed_prots_all.add(dbname)
206 | 
207 |     df_final = get_df_final(args, replace_label, allowed_peptides, allowed_prots_all, pep_RT=False, RT_threshold=False)
208 | 
209 |     logger.info('Total number of peptide sequences used in quantitation: %d', len(set(df_final['origseq'])))
210 | 
211 |     cols = [z for z in df_final.columns.tolist() if not z.startswith('mz_') and not z.startswith('RT_')]
212 |     df_final = df_final[cols]
213 | 
214 |     df_final = df_final.set_index('peptide')
215 | 
216 | 
217 |     all_lbls = all_s_lbls['S1'] + all_s_lbls['S2']
218 | 
219 |     df_final_copy = df_final.copy()
220 | 
221 |     custom_min_samples = int(args['min_samples'])
222 |     if custom_min_samples == 0:
223 |         custom_min_samples = int(len(all_lbls)/2)
224 | 
225 |     df_final = df_final_copy.copy()
226 | 
227 |     max_missing = len(all_lbls) - custom_min_samples
228 | 
229 |     logger.info('Allowed max number of missing values: %d', max_missing)
230 | 
231 |     df_final['nummissing'] = df_final.isna().sum(axis=1)
232 |     df_final['nummissing_S1'] = df_final[all_s_lbls['S1']].isna().sum(axis=1)
233 |     df_final['nummissing_S2'] = df_final[all_s_lbls['S2']].isna().sum(axis=1)
234 |     df_final['nonmissing_S1'] = len(all_s_lbls['S1']) - df_final['nummissing_S1']
235 |     df_final['nonmissing_S2'] = len(all_s_lbls['S2']) - df_final['nummissing_S2']
236 |     df_final['nonmissing'] = df_final['nummissing'] <= max_missing
237 | 
238 |     df_final = df_final[df_final['nonmissing']]
239 |     logger.info('Total number of PFMs passed missing values threshold: %d', len(df_final))
240 | 
241 | 
242 |     df_final['S2_mean'] = df_final[all_s_lbls['S2']].mean(axis=1)
243 |     df_final['S1_mean'] = df_final[all_s_lbls['S1']].mean(axis=1)
244 |     df_final['FC_raw'] = np.log2(df_final['S2_mean']/df_final['S1_mean'])
245 | 
246 |     FC_max = df_final['FC_raw'].max()
247 |     FC_min = df_final['FC_raw'].min()
248 | 
249 |     df_final.loc[(pd.isna(df_final['S2_mean'])) & (~pd.isna(df_final['S1_mean'])), 'FC_raw'] = FC_min
250 |     df_final.loc[(~pd.isna(df_final['S2_mean'])) & (pd.isna(df_final['S1_mean'])), 'FC_raw'] = FC_max
251 | 
252 |     if args['intensity_norm'] == 2:
253 |         for cc in all_lbls:
254 |             df_final[cc] = df_final[cc] / df_final[cc].nlargest(1000).sum()
255 | 
256 |     elif args['intensity_norm'] == 1:
257 |         for cc in all_lbls:
258 |             df_final[cc] = df_final[cc] / df_final[cc].median()
259 | 
260 |     for slbl in ['1', '2']:
261 |         S_len_current = len(all_s_lbls['S%s' % (slbl, )])
262 |         df_final['S%s_mean' % (slbl, )] = df_final[all_s_lbls['S%s' % (slbl, )]].mean(axis=1)
263 |         df_final['S%s_std' % (slbl, )] = np.log2(df_final[all_s_lbls['S%s' % (slbl, )]]).std(axis=1)
264 | 
265 |     df_final['S1_std'] = df_final['S1_std'].fillna(df_final['S2_std'])
266 |     df_final['S2_std'] = df_final['S2_std'].fillna(df_final['S1_std'])
267 | 
268 |     idx_to_calc_initial_pval = df_final[['nonmissing_S1', 'nonmissing_S2']].min(axis=1) >= 2
269 |     df_final.loc[idx_to_calc_initial_pval, 'p-value'] = list(ttest_ind(np.log10(df_final.loc[idx_to_calc_initial_pval, all_s_lbls['S1']].values.astype(float)), np.log10(df_final.loc[idx_to_calc_initial_pval, all_s_lbls['S2']].values.astype(float)), axis=1, nan_policy='omit', equal_var=True)[1])
270 |     df_final['p-value'] = df_final['p-value'].astype(float)
271 | 
272 |     for cc in all_lbls:
273 |         df_final[cc] = df_final[cc].fillna(df_final[cc].min())
274 | 
275 |     idx_missing_pval = pd.isna(df_final['p-value'])
276 | 
277 |     df_final.loc[idx_missing_pval, 'p-value'] = list(ttest_ind(np.log10(df_final.loc[idx_missing_pval, all_s_lbls['S1']].values.astype(float)), np.log10(df_final.loc[idx_missing_pval, all_s_lbls['S2']].values.astype(float)), axis=1, nan_policy='omit', equal_var=True)[1])
278 | 
279 |     df_final['p-value'] = df_final['p-value'].fillna(1.0)
280 |     p_val_threshold = 0.1
281 | 
282 |     for cc in all_lbls:
283 |         df_final[cc] = df_final[cc].fillna(df_final[cc].min())
284 | 
285 |     df_final['intensity_median'] = df_final[['S1_mean', 'S2_mean']].max(axis=1)
286 |     df_final['iq'] = pd.qcut(df_final['intensity_median'], 5, labels=range(5)).fillna(0).astype(int)
287 | 
288 |     df_final['FC'] = np.log2(df_final['S2_mean']/df_final['S1_mean'])
289 | 
290 | 
291 |     FC_max = df_final['FC'].max()
292 |     FC_min = df_final['FC'].min()
293 | 
294 |     df_final_for_calib = df_final.copy()
295 |     df_final_for_calib = df_final_for_calib[~pd.isna(df_final_for_calib['S1_mean'])]
296 |     df_final_for_calib = df_final_for_calib[df_final_for_calib['FC'] <= FC_max/2]
297 |     df_final_for_calib = df_final_for_calib[df_final_for_calib['FC'] >= FC_min/2]
298 |     df_final_for_calib = df_final_for_calib[~pd.isna(df_final_for_calib['S2_mean'])]
299 | 
300 |     df_final.loc[(pd.isna(df_final['S2_mean'])) & (~pd.isna(df_final['S1_mean'])), 'FC'] = FC_min
301 |     df_final.loc[(~pd.isna(df_final['S2_mean'])) & (pd.isna(df_final['S1_mean'])), 'FC'] = FC_max
302 | 
303 |     tmp = df_final_for_calib['FC']
304 | 
305 |     try:
306 |         FC_mean, FC_std, covvalue_cor = calibrate_mass(0.05, -tmp.min(), tmp.max(), tmp)
307 |         FC_mean2, FC_std2, covvalue_cor2 = calibrate_mass(0.1, -tmp.min(), tmp.max(), tmp)
308 |         if not np.isinf(covvalue_cor2) and abs(FC_mean2) <= abs(FC_mean) / 10:
309 |             FC_mean = FC_mean2
310 |             FC_std = FC_std2
311 |     except:
312 |         FC_mean, FC_std, covvalue_cor = calibrate_mass(0.3, -tmp.min(), tmp.max(), tmp)
313 | 
314 |     if not args['fold_change_abs']:
315 |         fold_change = FC_std * fold_change
316 |     logger.info('Absolute FC threshold = %.2f +- %.2f', FC_mean, fold_change)
317 | 
318 |     df_final['decoy'] = df_final['protein'].apply(lambda x: all(z.startswith(decoy_prefix) for z in x.split(';')))
319 | 
320 |     df_final = df_final.assign(protein=df_final['protein'].str.split(';')).explode('protein').reset_index(drop=False)
321 |     df_final['proteins'] = df_final['protein']
322 |     df_final = df_final.drop(columns=['protein'])
323 | 
324 |     df_final = df_final.sort_values(by=['nummissing', 'intensity_median'], ascending=(True, False))
325 |     df_final = df_final.drop_duplicates(subset=('origseq', 'proteins'))
326 | 
327 | 
328 |     df_final['FC_corrected'] = df_final['FC'] - FC_mean
329 | 
330 |     if args['protein_shifts']:
331 |         df_shifts = pd.read_table(args['protein_shifts'])
332 |         shifts_map = df_shifts.set_index('dbname')['FC shift'].to_dict()
333 |         df_final['FC_corrected'] = df_final.apply(lambda x: x['FC_corrected'] - shifts_map.get(x['proteins'], 0), axis=1)
334 | 
335 | 
336 | 
337 | 
338 |     df_final['FC_abs'] = df_final['FC_corrected'].abs()
339 |     df_final = df_final.sort_values(by='FC_abs').reset_index(drop=True)
340 |     df_final['FC_abs'] = df_final['FC_corrected']
341 | 
342 |     idx_stat = df_final['p-value'] <= p_val_threshold
343 |     df_final['FC_gr_mean'] = df_final.groupby('proteins', group_keys=False)['FC_abs'].apply(lambda x: x.expanding().mean())
344 | 
345 |     neg_idx = (df_final['FC_corrected'] < 0)
346 |     pos_idx = (df_final['FC_corrected'] >= 0)
347 | 
348 |     pos_idx_real = df_final[pos_idx].index
349 |     neg_idx_real = df_final[neg_idx].index
350 | 
351 |     pos_idx_real_set = set(pos_idx_real)
352 |     neg_idx_real_set = set(neg_idx_real)
353 | 
354 |     df_final_decoy = df_final[df_final['decoy']]
355 | 
356 |     FC_pools = dict()
357 |     FC_pools['common'] = dict()
358 | 
359 |     df1_decoy_grouped_common = df_final[df_final['decoy']].groupby('iq')
360 |     for group_name, df_group in df1_decoy_grouped_common:
361 |         FC_pools['common'][group_name] = list(df_group[['FC_abs', 'p-value']].values)
362 | 
363 |     df_final['sign'] = False
364 | 
365 |     df1_grouped = df_final.groupby('proteins')
366 |   
367 | 
368 |     for group_name, df_group in df1_grouped:
369 | 
370 |         prot_idx = df_group.index
371 | 
372 |         idx = sorted(list(prot_idx))
373 |         idx_len = len(idx)
374 | 
375 |         loc_pos_pvalues = df_final.loc[idx, 'p-value'].values
376 | 
377 |         loc_pos_values = df_final.loc[idx, 'FC_gr_mean'].values
378 |         loc_pos_values = np.abs(loc_pos_values)
379 | 
380 |         pos_missing_list = list(df_final.loc[idx, 'iq'].values)
381 |         better_res = np.array([0] * idx_len)
382 |         for _ in range(100):
383 |             random_list_tmp = [random.choice(FC_pools['common'][nm]) for nm in pos_missing_list]
384 |             random_list = [z[0] for z in random_list_tmp]
385 |             random_list_pvalues = np.array([z[1] for z in random_list_tmp])
386 | 
387 |             list_to_compare_current = np.cumsum(random_list) / np.arange(1, idx_len+1, 1)
388 |             list_to_compare_current = np.abs(list_to_compare_current)
389 | 
390 |             better_res += (loc_pos_values >= list_to_compare_current) * (loc_pos_pvalues <= random_list_pvalues)
391 |         df_final.loc[idx, 'bp'] = better_res
392 | 
393 |         # Equivalent of 5% probability to be Random
394 |         df_final.loc[idx, 'sign'] = df_final.loc[idx, 'bp'] >= 58
395 | 
396 |     df_final['up'] = df_final['sign'] * (df_final['FC_corrected'] > 0)
397 |     df_final['down'] = df_final['sign'] * (df_final['FC_corrected'] < 0)
398 | 
399 |     cols = [z for z in df_final.columns.tolist() if not z.startswith('mz_') and not z.startswith('RT_')]
400 |     cols.remove('proteins')
401 |     cols.insert(0, 'proteins')
402 |     df_final = df_final[cols]
403 | 
404 |     df_final.to_csv(path_or_buf=args['out']+'_quant_peptides.tsv', sep='\t', index=False)
405 | 
406 |     df_final = df_final.sort_values(by=['nummissing', 'intensity_median'], ascending=(True, False))
407 |     df_final = df_final.drop_duplicates(subset=('origseq', 'proteins'))
408 | 
409 |     prot_to_peps = defaultdict(str)
410 |     for dbname, pepseq in df_final.sort_values(by='origseq')[['proteins', 'origseq']].values:
411 |         prot_to_peps[dbname] += pepseq
412 | 
413 |     all_peps_cnt = Counter(list(prot_to_peps.values()))
414 |     peps_more_than_2 = set([k for k, v in all_peps_cnt.items() if v >= 2])
415 | 
416 |     pep_groups = {}
417 |     protein_groups = {}
418 |     cur_group = 1
419 |     for dbname, pepseq in prot_to_peps.items():
420 |         if pepseq not in peps_more_than_2:
421 |             protein_groups[dbname] = cur_group
422 |             cur_group += 1
423 |         else:
424 |             if pepseq not in pep_groups:
425 |                 pep_groups[pepseq] = cur_group
426 |                 protein_groups[dbname] = cur_group
427 |                 cur_group += 1
428 |             else:
429 |                 protein_groups[dbname] = pep_groups[pepseq]
430 | 
431 |     del pep_groups
432 |     del prot_to_peps
433 |     del peps_more_than_2
434 |     del all_peps_cnt
435 | 
436 |     up_dict = df_final.groupby('proteins')['up'].sum().to_dict()
437 |     down_dict = df_final.groupby('proteins')['down'].sum().to_dict()
438 | 
439 |     ####### !!!!!!! #######
440 |     df_final['up'] = df_final.apply(lambda x: x['up'] if up_dict.get(x['proteins'], 0) >= down_dict.get(x['proteins'], 0) else x['down'], axis=1)
441 |     protsN = df_final.groupby('proteins')['up'].count().to_dict()
442 | 
443 |     prots_up = df_final.groupby('proteins')['up'].sum()
444 |     decoy_df = df_final[df_final['decoy']].drop_duplicates(subset='origseq')
445 | 
446 |     N_decoy_total = len(decoy_df)
447 |     upreg_decoy_total = decoy_df['up'].sum()
448 | 
449 |     N_nondecoy_total = (~df_final['decoy']).sum()
450 | 
451 |     p_up = upreg_decoy_total / N_decoy_total
452 | 
453 |     names_arr = np.array(list(protsN.keys()))
454 | 
455 |     logger.info('Total number of proteins used in quantitation: %d', sum(not z.startswith(decoy_prefix) for z in names_arr))
456 |     logger.info('Total number of peptides: %d', len(df_final))
457 |     logger.info('Total number of decoy peptides: %d', N_decoy_total)
458 |     logger.info('Probability of random peptide to be differentially expressed: %.3f', p_up)
459 | 
460 |     v_arr = np.array(list(prots_up.get(k, 0) for k in names_arr))
461 |     n_arr = np.array(list(protsN.get(k, 0) for k in names_arr))
462 | 
463 |     all_pvals = calc_sf_all(v_arr, n_arr, p_up)
464 | 
465 |     total_set = set()
466 |     total_set_genes = set()
467 | 
468 |     FC_up_dict_basic = df_final.groupby('proteins')['FC_corrected'].median().to_dict()
469 |     FC_up_dict_raw_basic = df_final.groupby('proteins')['FC_raw'].median().to_dict()
470 | 
471 |     df_final_up_idx = (df_final['up']>0)
472 | 
473 |     df_final.loc[df_final_up_idx, 'bestmissing'] = df_final.loc[df_final_up_idx, :].groupby('proteins')['nummissing'].transform('min')
474 | 
475 |     FC_up_dict2 = df_final.loc[df_final_up_idx, :].groupby('proteins')['FC_corrected'].median().to_dict()
476 |     FC_up_dict_raw2 = df_final.loc[df_final_up_idx, :].groupby('proteins')['FC_raw'].median().to_dict()
477 | 
478 |     df_out = pd.DataFrame()
479 |     df_out['score'] = all_pvals
480 |     df_out['dbname'] = names_arr
481 | 
482 |     df_out['log2FoldChange(S2/S1)'] = df_out['dbname'].apply(lambda x: FC_up_dict2.get(x))
483 |     df_out['log2FoldChange(S2/S1) no normalization'] = df_out['dbname'].apply(lambda x: FC_up_dict_raw2.get(x))
484 | 
485 |     df_out.loc[pd.isna(df_out['log2FoldChange(S2/S1)']), 'log2FoldChange(S2/S1)'] = df_out.loc[pd.isna(df_out['log2FoldChange(S2/S1)']), 'dbname'].apply(lambda x: FC_up_dict_basic.get(x))
486 |     df_out.loc[pd.isna(df_out['log2FoldChange(S2/S1) no normalization']), 'log2FoldChange(S2/S1) no normalization'] = df_out.loc[pd.isna(df_out['log2FoldChange(S2/S1) no normalization']), 'dbname'].apply(lambda x: FC_up_dict_raw_basic.get(x))
487 | 
488 | 
489 |     df_out.loc[:, 'log2FoldChange(S2/S1) using all peptides'] = df_out.loc[:, 'dbname'].apply(lambda x: FC_up_dict_basic.get(x))
490 |     df_out.loc[:, 'log2FoldChange(S2/S1) using all peptides and no normalization'] = df_out.loc[:, 'dbname'].apply(lambda x: FC_up_dict_raw_basic.get(x))
491 | 
492 |     df_out['differentially expressed peptides'] = v_arr
493 |     df_out['identified peptides'] = n_arr
494 | 
495 |     df_out['decoy'] = df_out['dbname'].str.startswith(decoy_prefix)
496 | 
497 |     df_out = df_out[~df_out['decoy']]
498 | 
499 |     df_out['protname'] = df_out['dbname'].apply(lambda x: x.split('|')[1] if '|' in x else x)
500 |     df_out['protein_quant_group'] = df_out['dbname'].apply(lambda x: protein_groups[x])
501 | 
502 |     genes_map = {}
503 |     if args['d']:
504 |         for prot, protseq in fasta.read(args['d']):
505 |             if decoy_prefix not in prot:
506 |                 try:
507 |                     prot_name = prot.split('|')[1]
508 |                 except:
509 |                     prot_name = prot.split(' ')[0]
510 |                 try:
511 |                     gene_name = prot.split('GN=')[1].split(' ')[0]
512 |                 except:
513 |                     gene_name = prot.split(' ')[0]
514 |                 genes_map[prot_name] = gene_name
515 | 
516 |         df_out['gene'] = df_out['protname'].apply(lambda x: genes_map[x])
517 | 
518 |     else:
519 |         df_out['gene'] = df_out['protname']
520 | 
521 | 
522 |     qval_threshold = args['qval']
523 | 
524 |     min_matched_peptides = 3
525 | 
526 |     df_out = df_out.sort_values(by='score', ascending=False).reset_index(drop=True)
527 | 
528 |     df_out['FC_pass'] = False
529 |     df_out['FC_pass'] = df_out['log2FoldChange(S2/S1)'].abs() >= fold_change
530 | 
531 | 
532 |     BH_idx = (df_out['identified peptides'] >= min_matched_peptides) & (df_out['FC_pass'])
533 | 
534 |     BH_idx_pos = (df_out['log2FoldChange(S2/S1)'] >= 0) & (df_out['identified peptides'] >= min_matched_peptides)
535 |     BH_idx_neg = (df_out['log2FoldChange(S2/S1)'] < 0) & (df_out['identified peptides'] >= min_matched_peptides)
536 | 
537 |     # df_out['p-value'] = 1.0
538 |     df_out['p-value'] = 10**(-df_out['score'])
539 |     df_out['BH_pass'] = False
540 | 
541 |     df_out_BH_multiplier = len(set(df_out[BH_idx]['protein_quant_group']))
542 |     lbl_to_use = 'protein_quant_group'
543 | 
544 |     current_rank = 0
545 |     BH_threshold_array = []
546 |     added_groups = set()
547 |     for z in df_out[BH_idx][lbl_to_use].values:
548 |         if z not in added_groups:
549 |             added_groups.add(z)
550 |             current_rank += 1
551 |         BH_threshold_array.append(-np.log10(current_rank * qval_threshold / df_out_BH_multiplier))
552 |     df_out.loc[BH_idx, 'BH_threshold'] = BH_threshold_array
553 | 
554 |     df_out.loc[BH_idx, 'BH_pass'] = df_out.loc[BH_idx, 'score'] >= df_out.loc[BH_idx, 'BH_threshold']
555 |     df_out.loc[BH_idx, 'FDR_pass'] = df_out.loc[BH_idx, 'score'] >= -np.log10(args['qval'])
556 |     # df_out.loc[BH_idx, 'BH_pass'] = df_out.loc[BH_idx, 'score'] >= -np.log10(args['qval'])
557 | 
558 |     # score_threshold = df_out.loc[(df_out['BH_pass']) & (BH_idx)]['score'].min()
559 |     # df_out.loc[BH_idx, 'BH_pass'] = df_out.loc[BH_idx, 'score'] >= score_threshold
560 | 
561 |     df_out = df_out.drop(columns = {'decoy'})
562 | 
563 |     df_out = df_out[['score', 'p-value', 'dbname', 'log2FoldChange(S2/S1)', 'differentially expressed peptides',
564 |                     'identified peptides', 'log2FoldChange(S2/S1) no normalization', 'log2FoldChange(S2/S1) using all peptides',
565 |                     'log2FoldChange(S2/S1) using all peptides and no normalization', 'protname', 'protein_quant_group', 'gene', 'FC_pass', 'FDR_pass', 'BH_pass',]]# 'BH_threshold']]
566 | 
567 |     df_out.to_csv(path_or_buf=args['out']+'_quant_full.tsv', sep='\t', index=False)
568 | 
569 |     # df_out_f = df_out[(df_out['FDR_pass']) & (df_out['FC_pass'])]
570 |     df_out_f = df_out[(df_out['BH_pass']) & (df_out['FC_pass'])]
571 | 
572 |     df_out_f.to_csv(path_or_buf=args['out']+'.tsv', sep='\t', index=False)
573 | 
574 |     for z in set(df_out_f['dbname']):
575 |         try:
576 |             prot_name = z.split('|')[1]
577 |         except:
578 |             prot_name = z
579 | 
580 |         gene_name = genes_map.get(prot_name, prot_name)
581 | 
582 |         total_set.add(prot_name)
583 |         total_set_genes.add(gene_name)
584 | 
585 |     logger.info('Total number of significantly changed proteins: %d', len(total_set))
586 |     logger.info('Total number of significantly changed genes: %d', len(total_set_genes))
587 | 
588 |     f1 = open(args['out'] + '_proteins_for_stringdb.txt', 'w')
589 |     for z in total_set:
590 |         f1.write(z + '\n')
591 |     f1.close()
592 | 
593 |     f1 = open(args['out'] + '_genes_for_stringdb.txt', 'w')
594 |     for z in total_set_genes:
595 |         f1.write(z + '\n')
596 |     f1.close()
597 | 
598 | if __name__ == '__main__':
599 |     run()
600 | 


--------------------------------------------------------------------------------
/ms1searchpy/directms1quantmulti.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division
  2 | import argparse
  3 | import pandas as pd
  4 | import numpy as np
  5 | from scipy.stats import binom, ttest_ind
  6 | from scipy.optimize import curve_fit
  7 | import logging
  8 | from pyteomics import fasta
  9 | from collections import Counter, defaultdict
 10 | import random
 11 | random.seed(42)
 12 | 
 13 | from . import directms1quant
 14 | from os import path, listdir, makedirs
 15 | from copy import copy
 16 | import matplotlib.pyplot as plt
 17 | import seaborn as sb
 18 | 
 19 | logger = logging.getLogger(__name__)
 20 | 
 21 | def run():
 22 |     parser = argparse.ArgumentParser(
 23 |         description='Prepare LFQ protein table for project with multiple conditions (time-series,\
 24 |         concentration, thermal profiling, etc.',
 25 |         epilog='''
 26 | 
 27 |     Example usage
 28 |     -------------
 29 |     $ directms1quantmulti -S1 sample1_1_proteins_full.tsv sample1_n_proteins_full.tsv -S2 sample2_1_proteins_full.tsv sample2_n_proteins_full.tsv
 30 |     -------------
 31 |     ''',
 32 |         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 33 | 
 34 |     parser.add_argument('-pdir', help='path to project folder with search results', required=True)
 35 |     parser.add_argument('-d', '-db', help='path to uniprot fasta file used for ms1searchpy', required=True)
 36 |     parser.add_argument('-samples', help='tsv table with sample details', required=True)
 37 |     parser.add_argument('-out', help='name of DirectMS1quant output files', default='DQmulti')
 38 |     parser.add_argument('-norm', help='LFQ normalization: (1) using median of CONTROL group, default; (2) using median of all groups', default=1, type=int)
 39 |     parser.add_argument('-proteins_for_figure', help='path to proteins for figure plotting', default='', type=str)
 40 |     parser.add_argument('-figdir', help='path to output folder for figures', default='')
 41 |     parser.add_argument('-min_samples', help='minimum number of non-missing peptide intensities for DirectMS1Quant pairwise comparisons. 0 means 50%% of input files', default=1)
 42 |     parser.add_argument('-pep_min_non_missing_samples', help='minimum fraction of files with non missing values for peptide', default=0.5, type=float)
 43 |     # parser.add_argument('-min_signif_for_pept', help='minimum number of pairwise DE results where peptide should be significant', default=999, type=int)
 44 |     parser.add_argument('-prefix', help='Decoy prefix. Default DECOY_', default='DECOY_', type=str)
 45 |     parser.add_argument('-start_stage', help='Can be 1, 2 or 3 to skip any stage which were already done', default=1, type=int)
 46 |     args = vars(parser.parse_args())
 47 |     logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s',
 48 |             datefmt='[%H:%M:%S]', level=logging.INFO)
 49 | 
 50 |     process_files(args)
 51 | 
 52 | 
 53 | def process_files(args):
 54 | 
 55 |     infolder = args['pdir']
 56 |     ms1folder = args['pdir']
 57 |     path_to_fasta = args['d']
 58 | 
 59 |     df1 = pd.read_table(args['samples'], dtype={'group': str, 'condition': str, 'vs': str})
 60 |     df1['sample'] = df1['group']
 61 |     if 'condition' not in df1.columns:
 62 |         df1['condition'] = ''
 63 |         
 64 |     if 'replicate' not in df1.columns:
 65 |         df1['replicate'] = 1
 66 |         
 67 |     if 'BatchMS' not in df1.columns:
 68 |         df1['BatchMS'] = 1
 69 | 
 70 |     if 'vs' not in df1.columns:
 71 |         df1['vs'] = df1['condition']
 72 |     df1['vs'] = df1['vs'].fillna(df1['condition'])
 73 | 
 74 |     df1_filenames = set(df1['File Name'])
 75 | 
 76 |     f_dict_map = {}
 77 | 
 78 |     for fname, sname, ttime in df1[['File Name', 'sample', 'condition']].values:
 79 |         f_dict_map[fname] = (sname, ttime)
 80 |         
 81 |         
 82 |     s_files_dict = defaultdict(list)
 83 | 
 84 |     for fn in listdir(ms1folder):
 85 |         if fn.endswith('_proteins_full.tsv'):
 86 |             f_name = fn.replace('.features_proteins_full.tsv', '')
 87 |             if f_name in f_dict_map:
 88 |                 s_files_dict[f_dict_map[f_name]].append(path.join(ms1folder, fn))
 89 | 
 90 |     control_label = df1['group'].values[0]
 91 | 
 92 | 
 93 |     all_conditions = df1[df1['group']!=control_label].set_index(['group', 'condition'])['vs'].to_dict()
 94 | 
 95 |     outlabel = args['out']
 96 | 
 97 | 
 98 |     dquant_params_base = {
 99 |         'min_samples': args['min_samples'],
100 |         'fold_change': 2.0,
101 |         'bp': 80,
102 |         'minl': 7,
103 |         'qval': 0.05,
104 |         'intensity_norm': 2,
105 |         'allowed_peptides': '',
106 |         'protein_shifts': '',
107 |         'allowed_proteins': '',
108 |         'all_proteins': '',
109 |         'all_pfms': '',
110 |         'fold_change_abs': '',
111 |         'prefix': args['prefix'],
112 |         'd': args['d'],
113 |         'fold_change_no_correction': '',
114 |     }
115 | 
116 | 
117 |     if args['start_stage'] <= 1:
118 | 
119 |         logger.info('Starting Stage 1: Run pairwise DirectMS1Quant runs...')
120 | 
121 |         for i2, i1_val in all_conditions.items():
122 |             out_name = path.join(ms1folder, '%s_directms1quant_out_%s_vs_%s%s' % (outlabel, ''.join(list(i2)), control_label, i1_val))
123 |             dquant_params = copy(dquant_params_base)
124 |             dquant_params['S1'] = s_files_dict[(control_label, i1_val)]
125 |             dquant_params['S2'] = s_files_dict[i2]
126 | 
127 |             dquant_params['out'] = out_name
128 | 
129 |             directms1quant.process_files(dquant_params)
130 | 
131 |     else:
132 |         logger.info('Skipping Stage 1: Run pairwise DirectMS1Quant runs...')
133 | 
134 | 
135 |     pep_cnt = Counter()
136 |     pep_cnt_up = Counter()
137 |     for i2, i1_val in all_conditions.items():
138 |         out_name = path.join(ms1folder, '%s_directms1quant_out_%s_vs_%s%s.tsv' % (outlabel, ''.join(list(i2)), control_label, i1_val))
139 |         # if os.path.isfile(out_name):
140 |         df0_full = pd.read_table(out_name.replace('.tsv', '_quant_peptides.tsv'), usecols=['origseq', 'up', 'down', 'proteins'])
141 |             
142 |         
143 | 
144 | 
145 |         up_dict = df0_full.groupby('proteins')['up'].sum().to_dict()
146 |         down_dict = df0_full.groupby('proteins')['down'].sum().to_dict()
147 | 
148 |         ####### !!!!!!! #######
149 |         df0_full['up'] = df0_full.apply(lambda x: x['up'] if up_dict.get(x['proteins'], 0) >= down_dict.get(x['proteins'], 0) else x['down'], axis=1)
150 | 
151 | 
152 |         df0_full = df0_full.sort_values(by='up', ascending=False)
153 |         df0_full = df0_full.drop_duplicates(subset=['origseq', 'proteins'])
154 |         for pep, up_v, prot_for_pep in df0_full[['origseq', 'up', 'proteins']].values:
155 |             if up_v:
156 |                 pep_cnt_up[(pep, prot_for_pep)] += 1
157 |             pep_cnt[(pep, prot_for_pep)] += 1
158 |             
159 | 
160 |     allowed_peptides_base = set(k for k, v in pep_cnt.items())
161 |     logger.info('Total number of quantified peptides: %d', len(allowed_peptides_base))
162 |     allowed_peptides_base_only_sequences = set(k[0] for k in allowed_peptides_base)
163 | 
164 |     # allowed_peptides_up = set(k for k, v in pep_cnt_up.items() if v >= args['min_signif_for_pept'])
165 |     # logger.info('Total number of significant quantified peptides: %d', len(allowed_peptides_up))
166 | 
167 |     replace_label = '_proteins_full.tsv'
168 |     decoy_prefix = args['prefix']
169 | 
170 |     names_map = {}
171 | 
172 |     args_local = {'S1': []}
173 |     args_local['allowed_peptides'] = ''
174 |     args_local['allowed_proteins'] = ''
175 |     for z in listdir(infolder):
176 |         if z.endswith(replace_label):
177 |             zname = z.split('.')[0]
178 |             if zname in df1_filenames:
179 |                 args_local['S1'].append(path.join(infolder, z))
180 |                 names_map[args_local['S1'][-1]] = zname
181 | 
182 |     all_s_lbls = {}
183 | 
184 |     allowed_prots = set()
185 |     allowed_prots_all = set()
186 |     allowed_peptides = set()
187 | 
188 |     cnt0 = Counter()
189 | 
190 |     if args['start_stage'] > 2:
191 | 
192 |         sample_num = 'S1'
193 | 
194 |         all_s_lbls[sample_num] = []
195 | 
196 |         for z in args_local[sample_num]:
197 |             label = sample_num + '_' + z.replace(replace_label, '')
198 |             all_s_lbls[sample_num].append(label)
199 | 
200 |         all_lbls = all_s_lbls['S1']
201 |         out_name = path.join(ms1folder, '%s_peptide_LFQ.tsv' % (outlabel, ))
202 |         df_final = pd.read_table(out_name)
203 | 
204 |         logger.info('Skipping Stage 2: Prepare full LFQ peptide table...')
205 | 
206 |     else:
207 |         logger.info('Starting Stage 2: Prepare full LFQ peptide table...')
208 |                 
209 |         sample_num = 'S1'
210 | 
211 |         all_s_lbls[sample_num] = []
212 | 
213 |         for z in args_local[sample_num]:
214 |             label = sample_num + '_' + z.replace(replace_label, '')
215 |             all_s_lbls[sample_num].append(label)
216 | 
217 |             df0 = pd.read_table(z.replace('_proteins_full.tsv', '_proteins.tsv'), usecols=['dbname', ])
218 |             allowed_prots.update(df0['dbname'])
219 |             allowed_prots.update([decoy_prefix + z for z in df0['dbname'].values])
220 | 
221 |             df0 = pd.read_table(z.replace('_proteins_full.tsv', '_PFMs_ML.tsv'), usecols=['seqs', 'qpreds'])#, 'ch', 'im'])
222 |             df0 = df0[df0['qpreds'] <= 10]
223 |             df0['seqs'] = df0['seqs']# + df0['ch'].astype(str) + df0['im'].astype(str)
224 |             allowed_peptides.update(df0['seqs'])
225 |             cnt0.update(df0['seqs'])
226 | 
227 |         custom_min_samples = 1
228 | 
229 |         allowed_peptides = set()
230 |         for k, v in cnt0.items():
231 |             if v >= custom_min_samples:
232 |                 allowed_peptides.add(k)
233 | 
234 | 
235 |         logger.info('Total number of TARGET protein GROUPS: %d', len(allowed_prots) / 2)
236 | 
237 |         sample_num = 'S1'
238 | 
239 |         if args_local.get(sample_num, 0):
240 |             for z in args_local[sample_num]:
241 |                 df3 = pd.read_table(z.replace(replace_label, '_PFMs.tsv'), usecols=['sequence', 'proteins', 'charge', 'ion_mobility'])
242 | 
243 |                 df3['tmpseq'] = df3['sequence']# + df3['charge'].astype(str) + df3['ion_mobility'].astype(str)
244 |                 df3 = df3[df3['tmpseq'].apply(lambda x: x in allowed_peptides)]
245 | 
246 |                 df3_tmp = df3[df3['proteins'].apply(lambda x: any(z in allowed_prots for z in x.split(';')))]
247 |                 for dbnames in set(df3_tmp['proteins'].values):
248 |                     for dbname in dbnames.split(';'):
249 |                         allowed_prots_all.add(dbname)
250 | 
251 |         df_final = directms1quant.get_df_final(args_local, replace_label, allowed_peptides, allowed_prots_all, pep_RT=False, RT_threshold=False)
252 | 
253 |         logger.info('Total number of peptide sequences used in quantitation: %d', len(set(df_final['origseq'])))
254 | 
255 |         cols = [z for z in df_final.columns.tolist() if not z.startswith('mz_') and not z.startswith('RT_')]
256 |         df_final = df_final[cols]
257 | 
258 |         df_final = df_final.set_index('peptide')
259 | 
260 |         all_lbls = all_s_lbls['S1']
261 | 
262 |         df_final_copy = df_final.copy()
263 | 
264 |         df_final = df_final_copy.copy()
265 | 
266 |         max_missing = len(all_lbls) - custom_min_samples
267 | 
268 |         df_final['nummissing'] = df_final.isna().sum(axis=1)
269 |         df_final['nonmissing'] = df_final['nummissing'] <= max_missing
270 | 
271 |         df_final = df_final[df_final['nonmissing']]
272 |         logger.info('Total number of PFMs: %d', len(df_final))
273 | 
274 |         out_name = path.join(ms1folder, '%s_peptide_LFQ.tsv' % (outlabel, ))
275 |         df_final.to_csv(out_name, sep='\t', index=False)
276 | 
277 | 
278 | 
279 | 
280 |     if args['start_stage'] <= 3:
281 | 
282 |         logger.info('Starting Stage 3: Prepare full LFQ protein table...')
283 | 
284 | 
285 |         all_lbls_by_batch = defaultdict(list)
286 |         all_filenames_control = set(df1[df1['group'] == control_label]['File Name'])
287 |         all_lbls_control = set()
288 | 
289 |         bdict = df1.set_index('File Name')['BatchMS'].to_dict()
290 | 
291 |         for cc in all_lbls:
292 |             try:
293 |                 origfn = names_map[path.join(infolder, cc.split('/')[-1] + replace_label)]
294 |             except:
295 |                 origfn = cc.split('_', 1)[-1] + replace_label
296 |             all_lbls_by_batch[bdict[origfn]].append(cc)
297 |             if origfn in all_filenames_control:
298 |                 all_lbls_control.add(cc)
299 |             
300 | 
301 |         for lbl_name, small_lbls in all_lbls_by_batch.items():
302 |             lbl_len = len(small_lbls)
303 |             idx_to_keep = df_final[small_lbls].isna().sum(axis=1) <= (1 - args['pep_min_non_missing_samples']) * lbl_len
304 |             df_final = df_final[idx_to_keep]
305 | 
306 |         for cc in all_lbls:
307 |             df_final[cc] = df_final[cc] / df_final[cc].nlargest(1000).sum()
308 | 
309 | 
310 | 
311 |         idx_to_keep = df_final['origseq'].apply(lambda x: x in allowed_peptides_base_only_sequences)
312 |         df_final = df_final[idx_to_keep]
313 | 
314 |         
315 |         for small_lbls in all_lbls_by_batch.values():
316 |             m = df_final[small_lbls].min(axis=1)
317 |             for col in df_final[small_lbls]:
318 |                 df_final.loc[:, col] = df_final.loc[:, col].fillna(m)
319 |                 
320 |         for lbl_key, small_lbls in all_lbls_by_batch.items():
321 | 
322 |             if args['norm'] == 2:
323 |                 m = df_final[small_lbls].median(axis=1)
324 |             elif args['norm'] == 1:
325 |                 small_lbls_control = [z for z in small_lbls if z in all_lbls_control]
326 |                 m = df_final[small_lbls_control].median(axis=1)
327 |             else:
328 |                 logger.error('norm option must be 1 or 2!')
329 |                 return -1
330 |             
331 |             
332 |             for col in df_final[small_lbls]:
333 |                 df_final.loc[:, col] = np.log2(df_final.loc[:, col] / m)
334 |                 
335 |         df_final = df_final.fillna(-10)
336 | 
337 | 
338 |         df_final = df_final.assign(protein=df_final['protein'].str.split(';')).explode('protein').reset_index(drop=True)
339 |         df_final['proteins'] = df_final['protein']
340 |         df_final = df_final.drop(columns=['protein'])
341 |         cols = [z for z in df_final.columns.tolist() if not z.startswith('mz_') and not z.startswith('RT_')]
342 |         cols.remove('proteins')
343 |         cols.insert(0, 'proteins')
344 |         df_final = df_final[cols]
345 | 
346 |         idx_to_keep = df_final.apply(lambda x: (x['origseq'], x['proteins']) in allowed_peptides_base, axis=1)
347 |         df_final = df_final[idx_to_keep].copy()
348 | 
349 | 
350 | 
351 |         # idx_to_keep = df_final['origseq'].apply(lambda x: x in allowed_peptides_up)
352 |         # idx_to_keep = df_final.apply(lambda x: (x['origseq'], x['proteins']) in allowed_peptides_up, axis=1)
353 |         # df_final_accurate = df_final[idx_to_keep].copy()
354 |         # df_final_accurate = df_final_accurate[df_final_accurate.groupby('proteins')['origseq'].transform('count') > 1]
355 |         # accurate_proteins = set(df_final_accurate['proteins'])
356 |         # df_final = pd.concat([df_final[df_final['proteins'].apply(lambda x: x not in accurate_proteins)], df_final_accurate])
357 | 
358 | 
359 |         df_final['S1_mean'] = df_final[all_lbls].median(axis=1)
360 |         df_final['intensity_median'] = df_final['S1_mean']
361 | 
362 |         df_final = df_final.sort_values(by=['nummissing', 'intensity_median'], ascending=(True, False))
363 |         df_final = df_final.drop_duplicates(subset=('origseq', 'proteins'))
364 | 
365 |         df_final = df_final[df_final['proteins'].apply(lambda x: not x.startswith('DECOY_'))]
366 | 
367 |         df_final = df_final[df_final.groupby('proteins')['S1_mean'].transform('count') > 1]
368 | 
369 |         origfnames = []
370 |         for cc in all_lbls:
371 |             origfn = cc.split('/')[-1].split('.')[0]
372 |             origfnames.append(origfn)
373 |             
374 |         dft = pd.DataFrame.from_dict([df_final.groupby('proteins')[cc].median().to_dict() for cc in all_lbls])
375 |         dft2 = pd.DataFrame.from_dict([df_final.groupby('origseq')[cc].median().to_dict() for cc in all_lbls])
376 | 
377 |         dft['File Name'] = origfnames
378 |         dft2['File Name'] = origfnames
379 | 
380 |         df1x = pd.merge(df1, dft2, on='File Name', how='left')
381 |         df1 = pd.merge(df1, dft, on='File Name', how='left')
382 | 
383 |         df1['sample+condition'] = df1['sample'].apply(lambda x: x + ' ') + df1['condition']
384 |         df1x['sample+condition'] = df1x['sample'].apply(lambda x: x + ' ') + df1x['condition']
385 | 
386 | 
387 |         out_name = path.join(ms1folder, '%s_proteins_LFQ.tsv' % (outlabel, ))
388 |         out_namex = path.join(ms1folder, '%s_peptide_LFQ_processed.tsv' % (outlabel, ))
389 |         df1.to_csv(out_name, sep='\t', index=False)
390 |         df1x.to_csv(out_namex, sep='\t', index=False)
391 | 
392 |     else:
393 | 
394 |         logger.info('Skipping Stage 3: Prepare full LFQ protein table...')
395 |         out_name = path.join(ms1folder, '%s_proteins_LFQ.tsv' % (outlabel, ))
396 |         df1 = pd.read_table(out_name)
397 | 
398 |     if args['start_stage'] <= 4:
399 | 
400 |         logger.info('Starting Stage 4: Plot figures for selected proteins...')
401 | 
402 |         warning_msg_1 = 'Provide file with proteins selected for figures.\
403 |             It should be a tsv table with dbname column. For example, it could be the standard output table of directms1quant\
404 |             Note, that protein names should be provided in uniprot format. For example, sp|P28838|AMPL_HUMAN'
405 | 
406 |         if not args['proteins_for_figure']:
407 |             logger.warning(warning_msg_1)
408 |         else:
409 | 
410 |             df_prots = pd.read_table(args['proteins_for_figure'])
411 |             if 'dbname' not in df_prots.columns:
412 |                 logger.warning('dbname column is missing in proteins file')
413 |             else:
414 | 
415 |                 allowed_proteins_for_figures = set(df_prots['dbname'])
416 | 
417 |                 for k in allowed_proteins_for_figures:
418 |                     if k not in df1.columns:
419 |                         logger.info('Protein %s was not quantified', k)
420 |                     else:
421 | 
422 | 
423 |                         try:
424 |                             gname = k.split('|')[1]
425 |                         except:
426 |                             gname = k
427 | 
428 | 
429 |                         plt.figure()
430 |                         prot_name = k
431 | 
432 |                         all_one_char_colors = ['m', 'r', 'c', 'g', 'b', 'y', ]
433 | 
434 |                         color_custom = {}
435 |                         for s_idx, s_val in enumerate(set(df1['sample'])):
436 |                             s_idx_sm = s_idx
437 |                             while s_idx_sm >= 6:
438 |                                 s_idx_sm -= 6
439 |                             color_custom[s_val] = all_one_char_colors[s_idx_sm]
440 | 
441 |                         my_pal = dict()
442 |                         for x_val, s_val in df1[['sample+condition', 'sample']].values:
443 |                             my_pal[x_val] = color_custom[s_val]
444 | 
445 |                         ax = sb.boxplot(data=df1, x='sample+condition', hue = 'sample+condition', y = prot_name, palette=my_pal, legend=False)
446 |                         xlabels_custom = [z for z in ax.get_xticklabels()]
447 |                         ax.set_xticks(ax.get_xticks())
448 |                         ax.set_xticklabels(xlabels_custom, rotation=90, size=12)
449 |                         plt.title(ax.get_ylabel())
450 |                         ax.set_ylabel('log2 LFQ', size=14)
451 |                         plt.subplots_adjust()
452 |                         plt.tight_layout()
453 |                         if args['figdir']:
454 |                             out_figdir = args['figdir']
455 |                             if not path.isdir(out_figdir):
456 |                                 makedirs(out_figdir)
457 |                         else:
458 |                             out_figdir = infolder
459 |                         plt.savefig(path.join(out_figdir, '%s_%s.png' % (args['out'], gname, )))
460 | 
461 | 
462 | if __name__ == '__main__':
463 |     run()
464 | 


--------------------------------------------------------------------------------
/ms1searchpy/group_specific.py:
--------------------------------------------------------------------------------
  1 | from .main import final_iteration, filter_results
  2 | from .utils import prot_gen
  3 | import pandas as pd
  4 | from collections import defaultdict, Counter
  5 | import argparse
  6 | import logging
  7 | import ete3
  8 | from ete3 import NCBITaxa
  9 | ncbi = NCBITaxa()
 10 | 
 11 | logger = logging.getLogger(__name__)
 12 | 
 13 | def run():
 14 |     parser = argparse.ArgumentParser(
 15 |         description='Combine DirectMS1 search results',
 16 |         epilog='''
 17 | 
 18 |     Example usage
 19 |     -------------
 20 |     $ ms1groups file1_PFMs_ML.tsv -d uniprot_shuffled.fasta -out group_statistics_by -fdr 5.0 -nproc 8 -groups genus
 21 |     -------------
 22 |     ''',
 23 |         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 24 | 
 25 |     parser.add_argument('file', nargs='+', help='input tsv PFMs_ML files for union')
 26 |     parser.add_argument('-d', '-db', help='path to protein fasta file', required=True)
 27 |     parser.add_argument('-out', help='prefix output file names', default='group_specific_statistics_by_')
 28 |     parser.add_argument('-prots_full', help='path to any of *_proteins_full.tsv file. By default this file will be searched in the folder with PFMs_ML files', default='')
 29 |     parser.add_argument('-fdr', help='protein fdr filter in %%', default=1.0, type=float)
 30 |     parser.add_argument('-prefix', help='decoy prefix', default='DECOY_')
 31 |     parser.add_argument('-nproc', help='number of processes', default=1, type=int)
 32 |     parser.add_argument('-groups', help="dbname: To use taxonomy in swiss-prot protein name (_HUMAN, _YEAST, etc.). OX: Use OX= from fasta file. Or can be 'species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom', 'domain'", default='dbname')
 33 |     parser.add_argument('-pp', help='protein priority table for keeping protein groups when merge results by scoring', default='')
 34 |     args = vars(parser.parse_args())
 35 |     logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s',
 36 |             datefmt='[%H:%M:%S]', level=logging.INFO)
 37 | 
 38 |     group_to_use = args['groups']
 39 |     allowed_groups = [
 40 |         'dbname', 'OX', 'species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom', 'domain'
 41 |     ]
 42 |     if group_to_use not in allowed_groups:
 43 |         logging.critical('group is not correct! Must be: %s', ','.join(allowed_groups))
 44 |         return -1
 45 | 
 46 |     dbname_map = dict()
 47 |     ox_map = dict()
 48 |     for dbinfo, dbseq in prot_gen(args):
 49 |         dbname = dbinfo.split(' ')[0]
 50 | 
 51 |         if group_to_use != 'dbname':
 52 |             try:
 53 |                 ox = dbinfo.split('OX=')[-1].split(' ')[0]
 54 |             except:
 55 |                 ox = 'Unknown'
 56 |         else:
 57 |             try:
 58 |                 ox = dbinfo.split(' ')[0].split('|')[-1].split('_')[-1]
 59 |             except:
 60 |                 ox = 'Unknown'
 61 |         dbname_map[dbname] = ox
 62 | 
 63 |     cnt = Counter(dbname_map.values())
 64 | 
 65 |     if group_to_use not in ['dbname', 'OX']:
 66 |         for ox in cnt.keys():
 67 | 
 68 |             line = ncbi.get_lineage(ox)
 69 |             ranks = ncbi.get_rank(line)
 70 |             if group_to_use not in ranks.values():
 71 |                 logger.warning('%s does not have %s', str(ox), group_to_use)
 72 |                 group_custom = 'OX:' + ox
 73 |                 # print('{} does not have {}'.format(i, group_to_use))
 74 |                 # continue
 75 | 
 76 |             else:
 77 |                 ranks_rev = {k[1]:k[0] for k in ranks.items()}
 78 |                 # print(ranks_rev)
 79 |                 group_custom = ranks_rev[group_to_use]
 80 | 
 81 |             ox_map[ox] = group_custom
 82 | 
 83 | 
 84 |         for dbname in list(dbname_map.keys()):
 85 |             dbname_map[dbname] = ox_map[dbname_map[dbname]]
 86 | 
 87 |         cnt = Counter(dbname_map.values())
 88 | 
 89 |     print(cnt.most_common())
 90 | 
 91 |     # return -1
 92 | 
 93 |     d_tmp = dict()
 94 | 
 95 |     df1 = None
 96 |     for idx, filen in enumerate(args['file']):
 97 |         logging.info('Reading file %s' % (filen, ))
 98 |         df3 = pd.read_csv(filen, sep='\t', usecols=['ids', 'qpreds', 'preds', 'decoy', 'seqs', 'proteins', 'peptide', 'iorig'])
 99 |         df3['ids'] = df3['ids'].apply(lambda x: '%d:%s' % (idx, str(x)))
100 |         df3['fidx'] = idx
101 | 
102 |         df3 = df3[df3['qpreds'] <= 10]
103 | 
104 | 
105 |         qval_ok = 0
106 |         for qval_cur in range(10):
107 |             if qval_cur != 10:
108 |                 df1ut = df3[df3['qpreds'] == qval_cur]
109 |                 decoy_ratio = df1ut['decoy'].sum() / len(df1ut)
110 |                 d_tmp[(idx, qval_cur)] = decoy_ratio
111 |                 # print(filen, qval_cur, decoy_ratio)
112 | 
113 |         if df1 is None:
114 |             df1 = df3
115 |             if args['prots_full']:
116 |                 df2 = pd.read_csv(args['prots_full'], sep='\t')
117 |             else:
118 |                 try:
119 |                     df2 = pd.read_csv(filen.replace('_PFMs_ML.tsv', '_proteins_full.tsv'), sep='\t')
120 |                 except:
121 |                     logging.critical('Proteins_full file is missing!')
122 |                     break
123 | 
124 |         else:
125 |             df1 = pd.concat([df1, df3], ignore_index=True)
126 | 
127 |     d_tmp = [z[0] for z in sorted(d_tmp.items(), key=lambda x: x[1])]
128 |     qdict = {}
129 |     for idx, val in enumerate(d_tmp):
130 |         qdict[val] = int(idx / len(args['file']))
131 |     df1['qpreds'] = df1.apply(lambda x: qdict[(x['fidx'], x['qpreds'])], axis=1)
132 | 
133 | 
134 |     pept_prot = defaultdict(set)
135 |     group_to_pep = defaultdict(set)
136 |     for seq, prots in df1[['seqs', 'proteins']].values:
137 |         for dbname in prots.split(';'):
138 |             pept_prot[seq].add(dbname)
139 |             group_to_pep[dbname_map[dbname]].add(seq)
140 | 
141 |     protsN = dict()
142 |     for dbname, theorpept in df2[['dbname', 'theoretical peptides']].values:
143 |         protsN[dbname] = theorpept
144 | 
145 | 
146 |     prefix = args['prefix']
147 |     isdecoy = lambda x: x[0].startswith(prefix)
148 |     isdecoy_key = lambda x: x.startswith(prefix)
149 |     escore = lambda x: -x[1]
150 |     fdr = float(args['fdr']) / 100
151 | 
152 |     # all_proteins = []
153 | 
154 |     base_out_name = args['out'] + group_to_use + '.tsv'
155 | 
156 |     out_dict = dict()
157 | 
158 |     for group_name in cnt:
159 | 
160 |         logging.info(group_name)
161 | 
162 |         df1_tmp = df1[df1['peptide'].apply(lambda x: x in group_to_pep[group_name])]
163 | 
164 |         protsN_tmp = dict()
165 |         for k, v in protsN.items():
166 |             dbname_map[k] == group_name
167 |             protsN_tmp[k] = v
168 | 
169 |         resdict = dict()
170 | 
171 |         resdict['qpreds'] = df1_tmp['qpreds'].values
172 |         resdict['preds'] = df1_tmp['preds'].values
173 |         resdict['seqs'] = df1_tmp['peptide'].values
174 |         resdict['ids'] = df1_tmp['ids'].values
175 |         resdict['iorig'] = df1_tmp['iorig'].values
176 | 
177 |         # mass_diff = resdict['qpreds']
178 |         # rt_diff = resdict['qpreds']
179 | 
180 |         e_ind = resdict['qpreds'] <= 9
181 |         resdict = filter_results(resdict, e_ind)
182 | 
183 |         mass_diff = resdict['qpreds']
184 |         rt_diff = resdict['qpreds']
185 | 
186 |         if args['pp']:
187 |             df4 = pd.read_table(args['pp'])
188 |             prots_spc_basic2 = df4.set_index('dbname')['score'].to_dict()
189 |         else:
190 |             prots_spc_basic2 = False
191 | 
192 | 
193 |     
194 |         top_proteins = final_iteration(resdict, mass_diff, rt_diff, pept_prot, protsN_tmp, base_out_name, prefix, isdecoy, isdecoy_key, escore, fdr, args['nproc'], prots_spc_basic2=prots_spc_basic2, output_all=False)
195 |         # all_proteins.extend(top_proteins)
196 |         out_dict[group_name] = len(top_proteins)
197 |         # print(top_proteins)
198 |         print('\n')
199 |         # break
200 |     
201 |     with open(base_out_name, 'w') as output:
202 |         output.write('taxid\tproteins\n')
203 |         for k, v in out_dict.items():
204 |             output.write('\t'.join((str(k), str(v))) + '\n')
205 | 
206 |     
207 | 
208 | if __name__ == '__main__':
209 |     run()
210 | 


--------------------------------------------------------------------------------
/ms1searchpy/main.py:
--------------------------------------------------------------------------------
   1 | import os
   2 | import numpy as np
   3 | from scipy.stats import scoreatpercentile, rankdata
   4 | from scipy.optimize import curve_fit
   5 | import operator
   6 | from copy import copy, deepcopy
   7 | from collections import defaultdict
   8 | from pyteomics import parser, mass, auxiliary as aux, achrom
   9 | try:
  10 |     from pyteomics import cmass
  11 | except ImportError:
  12 |     cmass = mass
  13 | import subprocess
  14 | import tempfile
  15 | import pandas as pd
  16 | import matplotlib
  17 | matplotlib.use('Agg')
  18 | from multiprocessing import Queue, Process, cpu_count
  19 | try:
  20 |     import seaborn
  21 |     seaborn.set(rc={'axes.facecolor':'#ffffff'})
  22 |     seaborn.set_style('whitegrid')
  23 | except:
  24 |     pass
  25 | 
  26 | from . import utils
  27 | from .utils_figures import plot_outfigures
  28 | import lightgbm as lgb
  29 | SEED = 50
  30 | import warnings
  31 | warnings.formatwarning = lambda msg, *args, **kw: str(msg) + '\n'
  32 | 
  33 | import logging
  34 | import numpy
  35 | import pandas
  36 | from sklearn import metrics
  37 | import csv
  38 | import ast
  39 | 
  40 | 
  41 | logger = logging.getLogger(__name__)
  42 | 
  43 | def worker_RT(qin, qout, shift, step, RC=False, ns=False, nr=False, win_sys=False):
  44 |     pepdict = dict()
  45 |     maxval = len(qin)
  46 |     start = 0
  47 |     while start + shift < maxval:
  48 |         item = qin[start+shift]
  49 |         pepdict[item] = achrom.calculate_RT(item, RC)
  50 |         start += step
  51 |     if win_sys:
  52 |         return pepdict
  53 |     else:
  54 |         qout.put(pepdict)
  55 |         qout.put(None)
  56 | 
  57 | 
  58 | 
  59 | 
  60 | def calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=False, p=False):
  61 | 
  62 |     if best_base_results is not False:
  63 | 
  64 |         sorted_proteins_names = sorted(best_base_results.items(), key=lambda x: -x[1])
  65 |         sorted_proteins_names_dict = {}
  66 |         for idx, k in enumerate(sorted_proteins_names):
  67 |             sorted_proteins_names_dict[k[0]] = idx
  68 |     else:
  69 |         sorted_proteins_names_dict = False
  70 | 
  71 | 
  72 |     prots_spc2 = defaultdict(set)
  73 |     for pep, proteins in pept_prot.items():
  74 |         if pep in p1:
  75 |             if sorted_proteins_names_dict is not False:
  76 |                 best_pos = min(sorted_proteins_names_dict[k] for k in proteins)
  77 |                 for protein in proteins:
  78 |                     if sorted_proteins_names_dict[protein] == best_pos:
  79 |                         prots_spc2[protein].add(pep)
  80 |             else:
  81 |                 for protein in proteins:
  82 |                     prots_spc2[protein].add(pep)
  83 | 
  84 | 
  85 |     for k in protsN:
  86 |         if k not in prots_spc2:
  87 |             prots_spc2[k] = set([])
  88 |     prots_spc = dict((k, len(v)) for k, v in prots_spc2.items())
  89 | 
  90 |     names_arr = np.array(list(prots_spc.keys()))
  91 |     v_arr = np.array(list(prots_spc.values()))
  92 |     n_arr = np.array([protsN[k] for k in prots_spc])
  93 | 
  94 |     if p is False:
  95 |         top100decoy_score = [prots_spc.get(dprot, 0) for dprot in protsN if isdecoy_key(dprot)]
  96 |         top100decoy_N = [val for key, val in protsN.items() if isdecoy_key(key)]
  97 |         p = np.mean(top100decoy_score) / np.mean(top100decoy_N)
  98 |         logger.info('Probability of random match for theoretical peptide = %.3f', (np.mean(top100decoy_score) / np.mean(top100decoy_N)))
  99 | 
 100 | 
 101 |     prots_spc = dict()
 102 |     all_pvals = utils.calc_sf_all(v_arr, n_arr, p)
 103 |     for idx, k in enumerate(names_arr):
 104 |         prots_spc[k] = all_pvals[idx]
 105 | 
 106 |     if best_base_results is not False:
 107 |         checked = set()
 108 |         for k, v in list(prots_spc.items()):
 109 |             if k not in checked:
 110 |                 if isdecoy_key(k):
 111 |                     if prots_spc.get(k.replace(prefix, ''), -1e6) > v:
 112 |                         del prots_spc[k]
 113 |                         checked.add(k.replace(prefix, ''))
 114 |                 else:
 115 |                     if prots_spc.get(prefix + k, -1e6) > v:
 116 |                         del prots_spc[k]
 117 |                         checked.add(prefix + k)
 118 | 
 119 |     return prots_spc, p
 120 | 
 121 | 
 122 | 
 123 | 
 124 | def calibrate_RT_gaus_full(rt_diff_tmp):
 125 |     RT_left = -min(rt_diff_tmp)
 126 |     RT_right = max(rt_diff_tmp)
 127 | 
 128 |     try:
 129 |         start_width = (scoreatpercentile(rt_diff_tmp, 95) - scoreatpercentile(rt_diff_tmp, 5)) / 100
 130 |         XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus(start_width, RT_left, RT_right, rt_diff_tmp)
 131 |     except:
 132 |         start_width = (scoreatpercentile(rt_diff_tmp, 95) - scoreatpercentile(rt_diff_tmp, 5)) / 50
 133 |         XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus(start_width, RT_left, RT_right, rt_diff_tmp)
 134 |     if np.isinf(covvalue):
 135 |         XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus(0.1, RT_left, RT_right, rt_diff_tmp)
 136 |     if np.isinf(covvalue):
 137 |         XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus(1.0, RT_left, RT_right, rt_diff_tmp)
 138 |     return XRT_shift, XRT_sigma, covvalue
 139 | 
 140 | def final_iteration(resdict, mass_diff, rt_diff, pept_prot, protsN, base_out_name, prefix, isdecoy, isdecoy_key, escore, fdr, nproc, out_log=False, fname=False, prots_spc_basic2=False, output_all=True, separate_figures=False):
 141 |     n = nproc
 142 |     prots_spc_basic = dict()
 143 | 
 144 |     p1 = set(resdict['seqs'])
 145 | 
 146 |     pep_pid = defaultdict(set)
 147 |     pid_pep = defaultdict(set)
 148 |     banned_dict = dict()
 149 |     for pep, pid in zip(resdict['seqs'], resdict['ids']):
 150 | 
 151 |         pep_pid[pep].add(pid)
 152 |         pid_pep[pid].add(pep)
 153 |         if pep in banned_dict:
 154 |             banned_dict[pep] += 1
 155 |         else:
 156 |             banned_dict[pep] = 1
 157 | 
 158 |     if len(p1):
 159 |         prots_spc_final = dict()
 160 |         prots_spc_copy = False
 161 |         prots_spc2 = False
 162 |         unstable_prots = set()
 163 |         p0 = False
 164 | 
 165 |         names_arr = False
 166 |         tmp_spc_new = False
 167 |         decoy_set = False
 168 | 
 169 |         cnt_target_final = 0
 170 |         cnt_decoy_final = 0
 171 | 
 172 |         while 1:
 173 |             if not prots_spc2:
 174 | 
 175 |                 best_match_dict = dict()
 176 |                 n_map_dict = defaultdict(list)
 177 |                 for k, v in protsN.items():
 178 |                     n_map_dict[v].append(k)
 179 | 
 180 |                 decoy_set = set()
 181 |                 for k in protsN:
 182 |                     if isdecoy_key(k):
 183 |                         decoy_set.add(k)
 184 |                 # decoy_set = list(decoy_set)
 185 | 
 186 | 
 187 |                 prots_spc2 = defaultdict(set)
 188 |                 for pep, proteins in pept_prot.items():
 189 |                     if pep in p1:
 190 |                         for protein in proteins:
 191 |                             prots_spc2[protein].add(pep)
 192 | 
 193 |                 for k in protsN:
 194 |                     if k not in prots_spc2:
 195 |                         prots_spc2[k] = set([])
 196 |                 prots_spc2 = dict(prots_spc2)
 197 |                 unstable_prots = set(prots_spc2.keys())
 198 | 
 199 |                 top100decoy_N = sum([val for key, val in protsN.items() if isdecoy_key(key)])
 200 | 
 201 |                 names_arr = np.array(list(prots_spc2.keys()))
 202 |                 n_arr = np.array([protsN[k] for k in names_arr])
 203 | 
 204 |                 tmp_spc_new = dict((k, len(v)) for k, v in prots_spc2.items())
 205 | 
 206 | 
 207 |                 # top100decoy_score_tmp = [tmp_spc_new.get(dprot, 0) for dprot in decoy_set]
 208 |                 # top100decoy_score_tmp_sum = float(sum(top100decoy_score_tmp))
 209 |                 top100decoy_score_tmp = {dprot: tmp_spc_new.get(dprot, 0) for dprot in decoy_set}
 210 |                 top100decoy_score_tmp_sum = float(sum(top100decoy_score_tmp.values()))
 211 | 
 212 |             prots_spc = tmp_spc_new
 213 |             if not prots_spc_copy:
 214 |                 prots_spc_copy = deepcopy(prots_spc)
 215 | 
 216 |             for v in decoy_set.intersection(unstable_prots):
 217 |                 top100decoy_score_tmp_sum -= top100decoy_score_tmp[v]
 218 |                 top100decoy_score_tmp[v] = prots_spc.get(v, 0)
 219 |                 top100decoy_score_tmp_sum += top100decoy_score_tmp[v]
 220 |             p = top100decoy_score_tmp_sum / top100decoy_N
 221 | 
 222 |             n_change = set(protsN[k] for k in unstable_prots)
 223 |             if len(best_match_dict) == 0:
 224 |                 n_change = sorted(n_change)
 225 |             for n_val in n_change:
 226 |                 for k in n_map_dict[n_val]:
 227 |                     v = prots_spc[k]
 228 |                     if n_val not in best_match_dict or v > prots_spc[best_match_dict[n_val]]:
 229 |                         best_match_dict[n_val] = k
 230 |             n_arr_small = []
 231 |             names_arr_small = []
 232 |             v_arr_small = []
 233 |             # for k, v in best_match_dict.items():
 234 |             #     n_arr_small.append(k)
 235 |             #     names_arr_small.append(v)
 236 |             #     v_arr_small.append(prots_spc[v])
 237 | 
 238 | 
 239 | 
 240 |             max_k = 0
 241 | 
 242 |             for k, v in best_match_dict.items():
 243 |                 num_matched = prots_spc[v]
 244 |                 if num_matched >= max_k:
 245 | 
 246 |                     max_k = num_matched
 247 | 
 248 |                     n_arr_small.append(k)
 249 |                     names_arr_small.append(v)
 250 |                     v_arr_small.append(prots_spc[v])
 251 | 
 252 | 
 253 |             prots_spc_basic = dict()
 254 |             all_pvals = utils.calc_sf_all(np.array(v_arr_small), n_arr_small, p)
 255 |             for idx, k in enumerate(names_arr_small):
 256 |                 prots_spc_basic[k] = all_pvals[idx]
 257 | 
 258 |             if not p0:
 259 |                 p0 = float(p)
 260 | 
 261 |                 prots_spc_tmp = dict()
 262 |                 v_arr = np.array([prots_spc[k] for k in names_arr])
 263 |                 all_pvals = utils.calc_sf_all(v_arr, n_arr, p)
 264 |                 for idx, k in enumerate(names_arr):
 265 |                     prots_spc_tmp[k] = all_pvals[idx]
 266 | 
 267 |                 sortedlist_spc = sorted(prots_spc_tmp.items(), key=operator.itemgetter(1))[::-1]
 268 |                 if output_all:
 269 |                     with open(base_out_name + '_proteins_full_noexclusion.tsv', 'w') as output:
 270 |                         output.write('dbname\tscore\tmatched peptides\ttheoretical peptides\n')
 271 |                         for x in sortedlist_spc:
 272 |                             output.write('\t'.join((x[0], str(x[1]), str(prots_spc_copy[x[0]]), str(protsN[x[0]]))) + '\n')
 273 | 
 274 |             best_prot = utils.keywithmaxval(prots_spc_basic)
 275 | 
 276 |             best_score = prots_spc_basic[best_prot]
 277 |             unstable_prots = set()
 278 |             if best_prot not in prots_spc_final:
 279 |                 prots_spc_final[best_prot] = best_score
 280 |                 banned_pids = set()
 281 |                 for pep in prots_spc2[best_prot]:
 282 |                     for pid in pep_pid[pep]:
 283 |                         banned_pids.add(pid)
 284 |                 for pid in banned_pids:
 285 |                     for pep in pid_pep[pid]:
 286 |                         banned_dict[pep] -= 1
 287 |                         if banned_dict[pep] == 0:
 288 |                             for bprot in pept_prot[pep]:
 289 |                                 tmp_spc_new[bprot] -= 1
 290 |                                 unstable_prots.add(bprot)
 291 | 
 292 | 
 293 |                 if best_prot in decoy_set:
 294 |                     cnt_decoy_final += 1
 295 |                 else:
 296 |                     cnt_target_final += 1
 297 | 
 298 | 
 299 |             else:
 300 | 
 301 |                 v_arr = np.array([prots_spc[k] for k in names_arr])
 302 |                 all_pvals = utils.calc_sf_all(v_arr, n_arr, p)
 303 |                 for idx, k in enumerate(names_arr):
 304 |                     prots_spc_basic[k] = all_pvals[idx]
 305 | 
 306 |                 for k, v in prots_spc_basic.items():
 307 |                     if k not in prots_spc_final:
 308 |                         prots_spc_final[k] = v
 309 | 
 310 |                 break
 311 |             try:
 312 |                 prot_fdr = cnt_decoy_final / cnt_target_final
 313 |                 # prot_fdr = aux.fdr(prots_spc_final.items(), is_decoy=isdecoy)
 314 |             except:
 315 |                 prot_fdr = 100.0
 316 |             if prot_fdr >= 12.5 * fdr:
 317 | 
 318 |                 v_arr = np.array([prots_spc[k] for k in names_arr])
 319 |                 all_pvals = utils.calc_sf_all(v_arr, n_arr, p)
 320 |                 for idx, k in enumerate(names_arr):
 321 |                     prots_spc_basic[k] = all_pvals[idx]
 322 | 
 323 |                 for k, v in prots_spc_basic.items():
 324 |                     if k not in prots_spc_final:
 325 |                         prots_spc_final[k] = v
 326 |                 break
 327 | 
 328 |     if prots_spc_basic2 is False:
 329 |         prots_spc_basic2 = copy(prots_spc_final)
 330 |     else:
 331 |         prots_spc_basic2 = prots_spc_basic2
 332 |         for k in prots_spc_final:
 333 |             if k not in prots_spc_basic2:
 334 |                 prots_spc_basic2[k] = 0
 335 |     prots_spc_final = dict()
 336 | 
 337 |     if n == 0:
 338 |         try:
 339 |             n = cpu_count()
 340 |         except NotImplementedError:
 341 |             n = 1
 342 | 
 343 |     if n == 1 or os.name == 'nt':
 344 |         qin = []
 345 |         qout = []
 346 |         for mass_koef in range(10):
 347 |             rtt_koef = mass_koef
 348 |             qin.append((mass_koef, rtt_koef))
 349 |         qout = worker(qin, qout, mass_diff, rt_diff, resdict, protsN, pept_prot, isdecoy_key, isdecoy, fdr, prots_spc_basic2, True)
 350 | 
 351 |         for item, item2 in qout:
 352 |             if item2:
 353 |                 prots_spc_copy = item2
 354 |             for k in protsN:
 355 |                 if k not in prots_spc_final:
 356 |                     prots_spc_final[k] = [item.get(k, 0.0), ]
 357 |                 else:
 358 |                     prots_spc_final[k].append(item.get(k, 0.0))
 359 | 
 360 |     else:
 361 |         qin = Queue()
 362 |         qout = Queue()
 363 | 
 364 |         for mass_koef in range(10):
 365 |             rtt_koef = mass_koef
 366 |             qin.put((mass_koef, rtt_koef))
 367 | 
 368 |         for _ in range(n):
 369 |             qin.put(None)
 370 | 
 371 |         procs = []
 372 |         for proc_num in range(n):
 373 |             p = Process(target=worker, args=(qin, qout, mass_diff, rt_diff, resdict, protsN, pept_prot, isdecoy_key, isdecoy, fdr, prots_spc_basic2))
 374 |             p.start()
 375 |             procs.append(p)
 376 | 
 377 |         for _ in range(n):
 378 |             for item, item2 in iter(qout.get, None):
 379 |                 if item2:
 380 |                     prots_spc_copy = item2
 381 |                 for k in protsN:
 382 |                     if k not in prots_spc_final:
 383 |                         prots_spc_final[k] = [item.get(k, 0.0), ]
 384 |                     else:
 385 |                         prots_spc_final[k].append(item.get(k, 0.0))
 386 | 
 387 |         for p in procs:
 388 |             p.join()
 389 | 
 390 |     for k in prots_spc_final.keys():
 391 |         prots_spc_final[k] = np.mean(prots_spc_final[k])
 392 | 
 393 |     prots_spc = deepcopy(prots_spc_final)
 394 |     sortedlist_spc = sorted(prots_spc.items(), key=operator.itemgetter(1))[::-1]
 395 |     if output_all:
 396 |         with open(base_out_name + '_proteins_full.tsv', 'w') as output:
 397 |             output.write('dbname\tscore\tmatched peptides\ttheoretical peptides\tdecoy\n')
 398 |             for x in sortedlist_spc:
 399 |                 output.write('\t'.join((x[0], str(x[1]), str(prots_spc_copy[x[0]]), str(protsN[x[0]]), str(isdecoy(x)))) + '\n')
 400 | 
 401 |     checked = set()
 402 |     for k, v in list(prots_spc.items()):
 403 |         if k not in checked:
 404 |             if isdecoy_key(k):
 405 |                 if prots_spc.get(k.replace(prefix, ''), -1e6) > v:
 406 |                     del prots_spc[k]
 407 |                     checked.add(k.replace(prefix, ''))
 408 |             else:
 409 |                 if prots_spc.get(prefix + k, -1e6) > v:
 410 |                     del prots_spc[k]
 411 |                     checked.add(prefix + k)
 412 | 
 413 |     filtered_prots = aux.filter(prots_spc.items(), fdr=fdr, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1, full_output=True, correction=1)
 414 |     if len(filtered_prots) < 1:
 415 |         filtered_prots = aux.filter(prots_spc.items(), fdr=fdr, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1, full_output=True, correction=0)
 416 |     identified_proteins = 0
 417 | 
 418 |     for x in filtered_prots:
 419 |         identified_proteins += 1
 420 | 
 421 |     logger.info('TOP 5 identified proteins:')
 422 |     logger.info('dbname\tscore\tmatched peptides\ttheoretical peptides')
 423 |     for x in filtered_prots[:5]:
 424 |         logger.info('\t'.join((str(x[0]), str(x[1]), str(int(prots_spc_copy[x[0]])), str(protsN[x[0]]))))
 425 |     logger.info('Final stage search: identified proteins = %d', identified_proteins)
 426 | 
 427 |     if output_all:
 428 |         with open(base_out_name + '_proteins.tsv', 'w') as output:
 429 |             output.write('dbname\tscore\tmatched peptides\ttheoretical peptides\tdecoy\n')
 430 |             for x in filtered_prots:
 431 |                 output.write('\t'.join((x[0], str(x[1]), str(prots_spc_copy[x[0]]), str(protsN[x[0]]), str(isdecoy(x)))) + '\n')
 432 | 
 433 |         if out_log or (fname and identified_proteins > 10):
 434 |             df1_peptides = pd.read_table(base_out_name + '_PFMs.tsv')
 435 |             df1_peptides['decoy'] = df1_peptides['proteins'].apply(lambda x: any(isdecoy_key(z) for z in x.split(';')))
 436 | 
 437 |             df1_proteins = pd.read_table(base_out_name + '_proteins_full.tsv')
 438 |             df1_proteins_f = pd.read_table(base_out_name + '_proteins.tsv')
 439 |             top_proteins = set(df1_proteins_f['dbname'])
 440 |             df1_peptides_f = df1_peptides[df1_peptides['proteins'].apply(lambda x: any(z in top_proteins for z in x.split(';')))]
 441 | 
 442 |             if fname and identified_proteins > 10:
 443 | 
 444 |                 df0 = pd.read_table(os.path.splitext(fname)[0] + '.tsv')
 445 | 
 446 |                 plot_outfigures(df0, df1_peptides, df1_peptides_f,
 447 |                     base_out_name, df_proteins=df1_proteins,
 448 |                     df_proteins_f=df1_proteins_f, prefix=prefix, separate_figures=separate_figures)
 449 | 
 450 |             if out_log is not False:
 451 |                 df1_peptides_f = df1_peptides_f.sort_values(by='Intensity')
 452 |                 df1_peptides_f = df1_peptides_f.drop_duplicates(subset='sequence', keep='last')
 453 |                 dynamic_range_estimation = np.log10(df1_peptides_f['Intensity'].quantile(0.99)) - np.log10(df1_peptides_f['Intensity'].quantile(0.01))
 454 |                 out_log.write('Estimated dynamic range in Log10 scale: %.1f\n' % (dynamic_range_estimation, ))
 455 |                 out_log.write('Matched peptides for top-scored proteins: %d\n' % (len(df1_peptides_f), ))
 456 |                 out_log.write('Identified proteins: %d\n' % (identified_proteins, ))
 457 |                 out_log.close()
 458 | 
 459 |             return top_proteins
 460 | 
 461 |     else:
 462 |         top_proteins_fullinfo = []
 463 |         for x in filtered_prots:
 464 |             top_proteins_fullinfo.append((x[0], str(x[1]), str(prots_spc_copy[x[0]]), str(protsN[x[0]]), str(isdecoy(x))))
 465 | 
 466 | 
 467 |         return top_proteins_fullinfo
 468 | 
 469 | def noisygaus(x, a, x0, sigma, b):
 470 |     return a * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2)) + b
 471 | 
 472 | def calibrate_mass(bwidth, mass_left, mass_right, true_md):
 473 | 
 474 |     bbins = np.arange(-mass_left, mass_right, bwidth)
 475 |     H1, b1 = np.histogram(true_md, bins=bbins)
 476 |     b1 = b1 + bwidth
 477 |     b1 = b1[:-1]
 478 | 
 479 | 
 480 |     popt, pcov = curve_fit(noisygaus, b1, H1, p0=[1, np.median(true_md), 1, 1])
 481 |     mass_shift, mass_sigma = popt[1], abs(popt[2])
 482 |     return mass_shift, mass_sigma, pcov[0][0]
 483 | 
 484 | def calibrate_RT_gaus(bwidth, mass_left, mass_right, true_md):
 485 | 
 486 |     bbins = np.arange(-mass_left, mass_right, bwidth)
 487 |     H1, b1 = np.histogram(true_md, bins=bbins)
 488 |     b1 = b1 + bwidth
 489 |     b1 = b1[:-1]
 490 | 
 491 | 
 492 |     popt, pcov = curve_fit(noisygaus, b1, H1, p0=[1, np.median(true_md), bwidth * 5, 1])
 493 |     mass_shift, mass_sigma = popt[1], abs(popt[2])
 494 |     return mass_shift, mass_sigma, pcov[0][0]
 495 | 
 496 | def process_file(args):
 497 |     utils.seen_target.clear()
 498 |     utils.seen_decoy.clear()
 499 |     args = utils.prepare_decoy_db(args)
 500 |     for filename in args['files']:
 501 | 
 502 |         # Temporary for pyteomics <= Version 4.5.5 bug
 503 |         from pyteomics import mass
 504 |         if 'H-' in mass.std_aa_mass:
 505 |             del mass.std_aa_mass['H-']
 506 |         if '-OH' in mass.std_aa_mass:
 507 |             del mass.std_aa_mass['-OH']
 508 | 
 509 |         try:
 510 |             args['file'] = filename
 511 |             process_peptides(deepcopy(args))
 512 |         except Exception as e:
 513 |             logger.error(e)
 514 |             logger.error('Search is failed for file: %s', filename)
 515 |     return 1
 516 | 
 517 | 
 518 | 
 519 | def prepare_peptide_processor(fname, args):
 520 |     global nmasses
 521 |     global rts
 522 |     global charges
 523 |     global ids
 524 |     global Is
 525 |     global Isums
 526 |     global Scans
 527 |     global Isotopes
 528 |     global mzraw
 529 |     global avraw
 530 |     global imraw
 531 | 
 532 |     min_ch = args['cmin']
 533 |     max_ch = args['cmax']
 534 | 
 535 |     min_isotopes = args['i']
 536 |     min_scans = args['sc']
 537 | 
 538 |     logger.info('Reading file %s', fname)
 539 | 
 540 |     df_features = utils.iterate_spectra(fname, min_ch, max_ch, min_isotopes, min_scans, args['nproc'], args['check_unique'])
 541 | 
 542 |     # Sort by neutral mass
 543 |     df_features = df_features.sort_values(by='massCalib')
 544 | 
 545 |     nmasses = df_features['massCalib'].values
 546 |     rts = df_features['rtApex'].values
 547 |     charges = df_features['charge'].values
 548 |     ids = df_features['id'].values
 549 |     Is = df_features['intensityApex'].values
 550 |     if 'intensitySum' in df_features.columns:
 551 |         Isums = df_features['intensitySum'].values
 552 |     else:
 553 |         Isums = df_features['intensityApex'].values
 554 |         logger.info('intensitySum column is missing in peptide features. Using intensityApex instead')
 555 | 
 556 |     Scans = df_features['nScans'].values
 557 |     Isotopes = df_features['nIsotopes'].values
 558 |     mzraw = df_features['mz'].values
 559 |     avraw = np.zeros(len(df_features))
 560 |     if len(set(df_features['FAIMS'])) > 1:
 561 |         imraw = df_features['FAIMS'].values
 562 |     else:
 563 |         imraw = df_features['im'].values
 564 | 
 565 |     logger.info('Number of peptide isotopic clusters passed filters: %d', len(nmasses))
 566 | 
 567 |     aa_mass, aa_to_psi = utils.get_aa_mass_with_fixed_mods(args['fmods'], args['fmods_legend'])
 568 | 
 569 |     acc_l = args['ptol']
 570 |     acc_r = args['ptol']
 571 | 
 572 |     return {'aa_mass': aa_mass, 'aa_to_psi': aa_to_psi, 'acc_l': acc_l, 'acc_r': acc_r, 'args': args}, df_features
 573 | 
 574 | def get_resdict(it, **kwargs):
 575 | 
 576 |     resdict = {
 577 |         'seqs': [],
 578 |         'md': [],
 579 |         'mods': [],
 580 |         'iorig': [],
 581 |     }
 582 | 
 583 |     for seqm in it:
 584 |         results = []
 585 |         m = cmass.fast_mass(seqm, aa_mass=kwargs['aa_mass']) + kwargs['aa_mass'].get('Nterm', 0) + kwargs['aa_mass'].get('Cterm', 0)
 586 |         acc_l = kwargs['acc_l']
 587 |         acc_r = kwargs['acc_r']
 588 |         dm_l = acc_l * m / 1.0e6
 589 |         if acc_r == acc_l:
 590 |             dm_r = dm_l
 591 |         else:
 592 |             dm_r = acc_r * m / 1.0e6
 593 |         start = nmasses.searchsorted(m - dm_l)
 594 |         end = nmasses.searchsorted(m + dm_r, side='right')
 595 |         for i in range(start, end):
 596 |             massdiff = (m - nmasses[i]) / m * 1e6
 597 |             mods = 0
 598 | 
 599 |             resdict['seqs'].append(seqm)
 600 |             resdict['md'].append(massdiff)
 601 |             resdict['mods'].append(mods)
 602 |             resdict['iorig'].append(i)
 603 | 
 604 | 
 605 |     for k in list(resdict.keys()):
 606 |         resdict[k] = np.array(resdict[k])
 607 | 
 608 |     return resdict
 609 | 
 610 | 
 611 | 
 612 | def filter_results(resultdict, idx):
 613 |     tmp = dict()
 614 |     for label in resultdict:
 615 |         tmp[label] = resultdict[label][idx]
 616 |     return tmp
 617 | 
 618 | def process_peptides(args):
 619 |     logger.info('Starting search...')
 620 | 
 621 |     fname_orig = args['file']
 622 |     if fname_orig.lower().endswith('mzml'):
 623 |         fname = os.path.splitext(fname_orig)[0] + '.features.tsv'
 624 |     else:
 625 |         fname = fname_orig
 626 | 
 627 |     fdr = args['fdr'] / 100
 628 |     min_isotopes_calibration = args['ci']
 629 |     min_scans_calibration = args['csc']
 630 |     try:
 631 |         outpath = args['o']
 632 |     except:
 633 |         outpath = False
 634 | 
 635 | 
 636 |     if outpath:
 637 |         base_out_name = os.path.splitext(os.path.join(outpath, os.path.basename(fname)))[0]
 638 |     else:
 639 |         base_out_name = os.path.splitext(fname)[0]
 640 | 
 641 |     out_log = open(base_out_name + '_log.txt', 'w')
 642 |     out_log.close()
 643 |     out_log = open(base_out_name + '_log.txt', 'w')
 644 |     out_log.write('Filename: %s\n' % (fname, ))
 645 |     out_log.write('Used parameters: %s\n' % (args.items(), ))
 646 | 
 647 |     deeplc_path = args['deeplc']
 648 |     if deeplc_path:
 649 |         from deeplc import DeepLC
 650 |         logging.getLogger('deeplc').setLevel(logging.ERROR)
 651 | 
 652 |     deeplc_model_path = args['deeplc_model_path']
 653 |     deeplc_model_path = deeplc_model_path.strip()
 654 | 
 655 |     if len(deeplc_model_path) > 0:
 656 |         if deeplc_model_path.endswith('.hdf5'):
 657 |             path_model = deeplc_model_path
 658 |         else:
 659 |             path_model = [os.path.join(deeplc_model_path,f) for f in os.listdir(deeplc_model_path) if f.endswith(".hdf5")]
 660 | 
 661 |     else:
 662 |         path_model = None
 663 | 
 664 |     if args['deeplc_library']:
 665 | 
 666 |         path_to_lib = args['deeplc_library']
 667 |         if not os.path.isfile(path_to_lib):
 668 |             lib_file = open(path_to_lib, 'w')
 669 |             lib_file.close()
 670 |         write_library = True
 671 | 
 672 |     else:
 673 |         path_to_lib = None
 674 |         write_library = False
 675 | 
 676 | 
 677 | 
 678 |     calib_path = args['pl']
 679 |     calib_path = calib_path.strip()
 680 | 
 681 |     if calib_path and args['ts']:
 682 |         args['ts'] = 0
 683 |         logger.info('Two-stage RT prediction does not work with list of MS/MS identified peptides...')
 684 | 
 685 |     args['enzyme'] = utils.get_enzyme(args['e'])
 686 | 
 687 | 
 688 |     prefix = args['prefix']
 689 |     protsN, pept_prot, ml_correction = utils.get_prot_pept_map(args)
 690 | 
 691 |     kwargs, df_features = prepare_peptide_processor(fname_orig, args)
 692 |     logger.info('Running the search ...')
 693 |     out_log.write('Number of features: %d\n' % (len(df_features), ))
 694 | 
 695 |     resdict = get_resdict(pept_prot, **kwargs)
 696 | 
 697 |     aa_to_psi = kwargs['aa_to_psi']
 698 | 
 699 |     if args['mc'] > 0:
 700 |         resdict['mc'] = np.array([parser.num_sites(z, args['enzyme']) for z in resdict['seqs']])
 701 | 
 702 |     isdecoy = lambda x: x[0].startswith(prefix)
 703 |     isdecoy_key = lambda x: x.startswith(prefix)
 704 |     escore = lambda x: -x[1]
 705 | 
 706 | 
 707 | 
 708 |     e_ind = np.array([Isotopes[iorig] for iorig in resdict['iorig']]) >= min_isotopes_calibration
 709 |     resdict2 = filter_results(resdict, e_ind)
 710 | 
 711 |     e_ind = np.array([Scans[iorig] for iorig in resdict2['iorig']]) >= min_scans_calibration
 712 |     resdict2 = filter_results(resdict2, e_ind)
 713 | 
 714 |     e_ind = resdict2['mods'] == 0
 715 |     resdict2 = filter_results(resdict2, e_ind)
 716 | 
 717 |     if args['mc'] > 0:
 718 |         e_ind = resdict2['mc'] == 0
 719 |         resdict2 = filter_results(resdict2, e_ind)
 720 | 
 721 |     p1 = set(resdict2['seqs'])
 722 | 
 723 |     if len(p1):
 724 | 
 725 |         # Calculate basic protein scores including homologues
 726 |         prots_spc, p = calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=False, p=False)
 727 |         # Calculate basic protein scores excluding homologues
 728 |         prots_spc, p = calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=prots_spc, p=p)
 729 | 
 730 | 
 731 | 
 732 | 
 733 |         # prots_spc2 = defaultdict(set)
 734 |         # for pep, proteins in pept_prot.items():
 735 |         #     if pep in p1:
 736 |         #         for protein in proteins:
 737 |         #             prots_spc2[protein].add(pep)
 738 | 
 739 |         # for k in protsN:
 740 |         #     if k not in prots_spc2:
 741 |         #         prots_spc2[k] = set([])
 742 |         # prots_spc = dict((k, len(v)) for k, v in prots_spc2.items())
 743 | 
 744 |         # names_arr = np.array(list(prots_spc.keys()))
 745 |         # v_arr = np.array(list(prots_spc.values()))
 746 |         # n_arr = np.array([protsN[k] for k in prots_spc])
 747 | 
 748 |         # top100decoy_score = [prots_spc.get(dprot, 0) for dprot in protsN if isdecoy_key(dprot)]
 749 |         # top100decoy_N = [val for key, val in protsN.items() if isdecoy_key(key)]
 750 |         # p = np.mean(top100decoy_score) / np.mean(top100decoy_N)
 751 |         # logger.info('Stage 0 search: probability of random match for theoretical peptide = %.3f', (np.mean(top100decoy_score) / np.mean(top100decoy_N)))
 752 | 
 753 |         # prots_spc = dict()
 754 |         # all_pvals = utils.calc_sf_all(v_arr, n_arr, p)
 755 |         # for idx, k in enumerate(names_arr):
 756 |         #     prots_spc[k] = all_pvals[idx]
 757 | 
 758 |         # # checked = set()
 759 |         # # for k, v in list(prots_spc.items()):
 760 |         # #     if k not in checked:
 761 |         # #         if isdecoy_key(k):
 762 |         # #             if prots_spc.get(k.replace(prefix, ''), -1e6) > v:
 763 |         # #                 del prots_spc[k]
 764 |         # #                 checked.add(k.replace(prefix, ''))
 765 |         # #         else:
 766 |         # #             if prots_spc.get(prefix + k, -1e6) > v:
 767 |         # #                 del prots_spc[k]
 768 |         # #                 checked.add(prefix + k)
 769 | 
 770 | 
 771 | 
 772 | 
 773 | 
 774 |         # for k in protsN:
 775 |         #     if k not in prots_spc:
 776 |         #         prots_spc[k] = 0
 777 | 
 778 |         # sorted_proteins_names = sorted(prots_spc.items(), key=lambda x: -x[1])
 779 |         # sorted_proteins_names_dict = {}
 780 |         # for idx, k in enumerate(sorted_proteins_names):
 781 |         #     sorted_proteins_names_dict[k[0]] = idx
 782 | 
 783 |         # prots_spc2 = defaultdict(set)
 784 |         # for pep, proteins in pept_prot.items():
 785 |         #     if pep in p1:
 786 |         #         best_pos = min(sorted_proteins_names_dict[k] for k in proteins)
 787 |         #         for protein in proteins:
 788 |         #             if sorted_proteins_names_dict[protein] == best_pos:
 789 |         #                 prots_spc2[protein].add(pep)
 790 | 
 791 |         # for k in protsN:
 792 |         #     if k not in prots_spc2:
 793 |         #         prots_spc2[k] = set([])
 794 |         # prots_spc = dict((k, len(v)) for k, v in prots_spc2.items())
 795 | 
 796 |         # names_arr = np.array(list(prots_spc.keys()))
 797 |         # v_arr = np.array(list(prots_spc.values()))
 798 |         # n_arr = np.array([protsN[k] for k in prots_spc])
 799 | 
 800 |         # prots_spc = dict()
 801 |         # all_pvals = utils.calc_sf_all(v_arr, n_arr, p)
 802 |         # for idx, k in enumerate(names_arr):
 803 |         #     prots_spc[k] = all_pvals[idx]
 804 | 
 805 |         # checked = set()
 806 |         # for k, v in list(prots_spc.items()):
 807 |         #     if k not in checked:
 808 |         #         if isdecoy_key(k):
 809 |         #             if prots_spc.get(k.replace(prefix, ''), -1e6) > v:
 810 |         #                 del prots_spc[k]
 811 |         #                 checked.add(k.replace(prefix, ''))
 812 |         #         else:
 813 |         #             if prots_spc.get(prefix + k, -1e6) > v:
 814 |         #                 del prots_spc[k]
 815 |         #                 checked.add(prefix + k)
 816 | 
 817 | 
 818 | 
 819 | 
 820 | 
 821 | 
 822 | 
 823 | 
 824 | 
 825 | 
 826 | 
 827 | 
 828 | 
 829 | 
 830 | 
 831 | 
 832 | 
 833 | 
 834 | 
 835 |         filtered_prots = aux.filter(prots_spc.items(), fdr=0.05, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1,
 836 |                                     full_output=True)
 837 | 
 838 |         identified_proteins = 0
 839 | 
 840 |         for x in filtered_prots:
 841 |             identified_proteins += 1
 842 |         logger.info('Stage 0 search: identified proteins = %d', identified_proteins)
 843 |         if identified_proteins <= 25:
 844 |             logger.info('Low number of identified proteins, using first 25 top scored proteins for calibration...')
 845 |             filtered_prots = sorted(prots_spc.items(), key=lambda x: -x[1])[:25]
 846 | 
 847 |         logger.info('Running mass recalibration...')
 848 | 
 849 |         df1 = pd.DataFrame()
 850 |         df1['mass diff'] = resdict['md']
 851 |         df1['mc'] = (resdict['mc'] if args['mc'] > 0 else 0)
 852 |         df1['iorig'] = resdict['iorig']
 853 |         df1['seqs'] = resdict['seqs']
 854 | 
 855 |         df1['mods'] = resdict['mods']
 856 | 
 857 |         df1['nIsotopes'] = [Isotopes[iorig] for iorig in df1['iorig'].values]
 858 |         df1['nScans'] = [Scans[iorig] for iorig in df1['iorig'].values]
 859 | 
 860 |         # df1['orig_md'] = true_md
 861 | 
 862 | 
 863 |         true_seqs = set()
 864 |         true_prots = set(x[0] for x in filtered_prots)
 865 |         for pep, proteins in pept_prot.items():
 866 |             if any(protein in true_prots for protein in proteins):
 867 |                 true_seqs.add(pep)
 868 | 
 869 |         df1['top_peps'] = (df1['mc'] == 0) & (df1['seqs'].apply(lambda x: x in true_seqs) & (df1['nIsotopes'] >= min_isotopes_calibration) & (df1['nScans'] >= min_scans_calibration))
 870 | 
 871 |         mass_calib_arg = args['mcalib']
 872 | 
 873 |         assert mass_calib_arg in [0, 1, 2]
 874 | 
 875 |         if mass_calib_arg:
 876 |             df1['RT'] = rts[df1['iorig'].values]#df1['iorig'].apply(lambda x: rts[x])
 877 | 
 878 |             if mass_calib_arg == 2:
 879 |                 df1['im'] = imraw[df1['iorig'].values]#df1['iorig'].apply(lambda x: imraw[x])
 880 |             elif mass_calib_arg == 1:
 881 |                 df1['im'] = 0
 882 | 
 883 |             im_set = set(df1['im'].unique())
 884 |             if len(im_set) <= 5:
 885 |                 df1['im_qcut'] = df1['im']
 886 |                 for im_value in im_set:
 887 |                     idx1 = df1['im'] == im_value
 888 |                     df1.loc[idx1, 'qpreds'] = str(im_value) + pd.qcut(df1.loc[idx1, 'RT'], 5, labels=range(5)).astype(str)
 889 |             else:
 890 |                 df1['im_qcut'] = pd.qcut(df1['im'], 5, labels=range(5)).astype(str)
 891 |                 im_set = set(df1['im_qcut'].unique())
 892 |                 for im_value in set(df1['im_qcut'].unique()):
 893 |                     idx1 = df1['im_qcut'] == im_value
 894 |                     df1.loc[idx1, 'qpreds'] = str(im_value) + pd.qcut(df1.loc[idx1, 'RT'], 5, labels=range(5)).astype(str)
 895 | 
 896 |             # df1['qpreds'] = pd.qcut(df1['RT'], 10, labels=range(10))#.astype(int)
 897 | 
 898 |             cor_dict = df1[df1['top_peps']].groupby('qpreds')['mass diff'].median().to_dict()
 899 | 
 900 |             rt_q_list = list(range(5))
 901 |             for im_value in im_set:
 902 |                 for rt_q in rt_q_list:
 903 |                     lbl_cur = str(im_value) + str(rt_q)
 904 |                     if lbl_cur not in cor_dict:
 905 | 
 906 |                         best_diff = 1e6
 907 |                         best_val = 0
 908 |                         for rt_q2 in rt_q_list:
 909 |                             cur_diff = abs(rt_q - rt_q2)
 910 |                             if cur_diff != 0:
 911 |                                 lbl_cur2 = str(im_value) + str(rt_q2)
 912 |                                 if lbl_cur2 in cor_dict:
 913 |                                     if cur_diff < best_diff:
 914 |                                         best_diff = cur_diff
 915 |                                         best_val = cor_dict[lbl_cur2]
 916 | 
 917 |                         cor_dict[lbl_cur] = best_val
 918 | 
 919 |             df1['mass diff q median'] = df1['qpreds'].apply(lambda x: cor_dict[x])
 920 |             df1['mass diff corrected'] = df1['mass diff'] - df1['mass diff q median']
 921 | 
 922 |         else:
 923 |             df1['qpreds'] = 0
 924 |             df1['mass diff q median'] = 0
 925 |             df1['mass diff corrected'] = df1['mass diff']
 926 | 
 927 | 
 928 | 
 929 | 
 930 |         mass_left = args['ptol']
 931 |         mass_right = args['ptol']
 932 | 
 933 |         try:
 934 |             mass_shift_cor, mass_sigma_cor, covvalue_cor = calibrate_mass(0.001, mass_left, mass_right, df1[df1['top_peps']]['mass diff corrected'])
 935 |         except:
 936 |             mass_shift_cor, mass_sigma_cor, covvalue_cor = calibrate_mass(0.01, mass_left, mass_right, df1[df1['top_peps']]['mass diff corrected'])
 937 | 
 938 |         if mass_calib_arg:
 939 | 
 940 |             try:
 941 |                 mass_shift, mass_sigma, covvalue = calibrate_mass(0.001, mass_left, mass_right, df1[df1['top_peps']]['mass diff'])
 942 |             except:
 943 |                 mass_shift, mass_sigma, covvalue = calibrate_mass(0.01, mass_left, mass_right, df1[df1['top_peps']]['mass diff'])
 944 | 
 945 |             logger.info('Uncalibrated mass shift: %.3f ppm', mass_shift)
 946 |             logger.info('Uncalibrated mass sigma: %.3f ppm', mass_sigma)
 947 | 
 948 |         logger.info('Estimated mass shift: %.3f ppm', mass_shift_cor)
 949 |         logger.info('Estimated mass sigma: %.3f ppm', mass_sigma_cor)
 950 | 
 951 |         out_log.write('Estimated mass shift: %.3f ppm\n' % (mass_shift_cor, ))
 952 |         out_log.write('Estimated mass sigma: %.3f ppm\n' % (mass_sigma_cor, ))
 953 | 
 954 |         resdict['md'] = df1['mass diff corrected'].values
 955 | 
 956 |         mass_shift = mass_shift_cor
 957 |         mass_sigma = mass_sigma_cor
 958 | 
 959 |         e_all = abs(resdict['md'] - mass_shift) / (mass_sigma)
 960 |         r = 3.0
 961 |         e_ind = e_all <= r
 962 |         resdict = filter_results(resdict, e_ind)
 963 | 
 964 | 
 965 |         e_ind = np.array([Isotopes[iorig] for iorig in resdict['iorig']]) >= min_isotopes_calibration
 966 |         resdict2 = filter_results(resdict, e_ind)
 967 | 
 968 | 
 969 |         e_ind = np.array([Scans[iorig] for iorig in resdict2['iorig']]) >= min_scans_calibration
 970 |         resdict2 = filter_results(resdict2, e_ind)
 971 | 
 972 |         e_ind = resdict2['mods'] == 0
 973 |         resdict2 = filter_results(resdict2, e_ind)
 974 | 
 975 | 
 976 |         if args['mc'] > 0:
 977 |             e_ind = resdict2['mc'] == 0
 978 |             resdict2 = filter_results(resdict2, e_ind)
 979 | 
 980 |         p1 = set(resdict2['seqs'])
 981 | 
 982 | 
 983 |         # Calculate basic protein scores including homologues
 984 |         prots_spc, p = calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=False, p=False)
 985 |         # Calculate basic protein scores excluding homologues
 986 |         prots_spc, p = calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=prots_spc, p=p)
 987 | 
 988 | 
 989 | 
 990 | 
 991 | 
 992 |         filtered_prots = aux.filter(prots_spc.items(), fdr=0.05, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1,
 993 |                                     full_output=True)
 994 | 
 995 |         identified_proteins = 0
 996 | 
 997 |         for x in filtered_prots:
 998 |             identified_proteins += 1
 999 |         logger.info('Stage 1 search: identified proteins = %d', identified_proteins)
1000 |         if identified_proteins <= 25:
1001 |             logger.info('Low number of identified proteins, using first 25 top scored proteins for calibration...')
1002 |             filtered_prots = sorted(prots_spc.items(), key=lambda x: -x[1])[:25]
1003 | 
1004 | 
1005 | 
1006 |         logger.info('Running RT prediction...')
1007 | 
1008 | 
1009 |         e_ind = np.array([Isotopes[iorig] for iorig in resdict['iorig']]) >= 1
1010 |         resdict2 = filter_results(resdict, e_ind)
1011 | 
1012 |         e_ind = resdict2['mods'] == 0
1013 |         resdict2 = filter_results(resdict2, e_ind)
1014 | 
1015 |         if args['mc'] > 0:
1016 |             e_ind = resdict2['mc'] == 0
1017 |             resdict2 = filter_results(resdict2, e_ind)
1018 | 
1019 | 
1020 |         true_seqs = []
1021 |         true_rt = []
1022 |         true_isotopes = []
1023 |         true_prots = set(x[0] for x in filtered_prots)
1024 |         for pep, proteins in pept_prot.items():
1025 |             if any(protein in true_prots for protein in proteins):
1026 |                 true_seqs.append(pep)
1027 |         e_ind = np.in1d(resdict2['seqs'], true_seqs)
1028 | 
1029 | 
1030 |         true_seqs = resdict2['seqs'][e_ind]
1031 | 
1032 |         true_rt.extend(np.array([rts[iorig] for iorig in resdict2['iorig']])[e_ind])
1033 |         true_rt = np.array(true_rt)
1034 |         true_isotopes.extend(np.array([Isotopes[iorig] for iorig in resdict2['iorig']])[e_ind])
1035 |         true_isotopes = np.array(true_isotopes)
1036 | 
1037 |         e_all = abs(resdict2['md'][e_ind] - mass_shift) / (mass_sigma)
1038 |         zs_all_tmp = e_all ** 2
1039 | 
1040 |         zs_all_tmp += (true_isotopes.max() - true_isotopes) * 100
1041 | 
1042 |         e_ind = np.argsort(zs_all_tmp)
1043 |         true_seqs = true_seqs[e_ind]
1044 |         true_rt = true_rt[e_ind]
1045 | 
1046 |         true_seqs = true_seqs[:2500]
1047 |         true_rt = true_rt[:2500]
1048 | 
1049 |         per_ind = np.random.RandomState(seed=SEED).permutation(len(true_seqs))
1050 |         true_seqs = true_seqs[per_ind]
1051 |         true_rt = true_rt[per_ind]
1052 | 
1053 |         best_seq = defaultdict(list)
1054 |         newseqs = []
1055 |         newRTs = []
1056 |         for seq, RT in zip(true_seqs, true_rt):
1057 |             best_seq[seq].append(RT)
1058 |         for k, v in best_seq.items():
1059 |             newseqs.append(k)
1060 |             newRTs.append(np.median(v))
1061 |         true_seqs = np.array(newseqs)
1062 |         true_rt = np.array(newRTs)
1063 | 
1064 |         if calib_path:
1065 |             df1 = pd.read_csv(calib_path, sep='\t')
1066 |             true_seqs = df1['peptide'].values
1067 |             true_rt = df1['RT exp'].values
1068 | 
1069 |             ll = len(true_seqs)
1070 |             true_seqs2 = true_seqs[int(ll/2):]
1071 |             true_rt2 = true_rt[int(ll/2):]
1072 |             true_seqs = true_seqs[:int(ll/2)]
1073 |             true_rt = true_rt[:int(ll/2)]
1074 | 
1075 |         else:
1076 | 
1077 |             ll = len(true_seqs)
1078 | 
1079 |             true_seqs2 = true_seqs[int(ll/2):]
1080 |             true_rt2 = true_rt[int(ll/2):]
1081 |             true_seqs = true_seqs[:int(ll/2)]
1082 |             true_rt = true_rt[:int(ll/2)]
1083 | 
1084 |         ns = true_seqs
1085 |         nr = true_rt
1086 |         ns2 = true_seqs2
1087 |         nr2 = true_rt2
1088 | 
1089 | 
1090 |         logger.info('First-stage peptides used for RT prediction: %d', len(true_seqs))
1091 | 
1092 |         # if args['ts'] != 2 and deeplc_path:
1093 | 
1094 |         #     dlc = DeepLC(verbose=False, batch_num=args['deeplc_batch_num'], path_model=path_model, write_library=write_library, use_library=path_to_lib, pygam_calibration=False)
1095 | 
1096 | 
1097 |         #     df_for_calib = pd.DataFrame({
1098 |         #         'seq': ns2,
1099 |         #         'modifications': [utils.mods_for_deepLC(seq, aa_to_psi) for seq in ns2],
1100 |         #         'tr': nr2,
1101 |         #     })
1102 | 
1103 |         #     df_for_check = pd.DataFrame({
1104 |         #         'seq': ns,
1105 |         #         'modifications': [utils.mods_for_deepLC(seq, aa_to_psi) for seq in ns],
1106 |         #         'tr': nr,
1107 |         #     })
1108 | 
1109 |         #     try:
1110 |         #         dlc.calibrate_preds(seq_df=df_for_calib, check_df=df_for_check)
1111 |         #     except:
1112 |         #         dlc.calibrate_preds(seq_df=df_for_calib)
1113 | 
1114 |         #     df_for_check['pr'] =  dlc.make_preds(seq_df=df_for_check)
1115 |         #     df_for_calib['pr'] =  dlc.make_preds(seq_df=df_for_calib)
1116 |         #     nr2_pred = df_for_calib['pr']
1117 | 
1118 |         #     rt_diff_tmp = df_for_check['pr'] - df_for_check['tr']
1119 | 
1120 |         #     XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus_full(rt_diff_tmp)
1121 | 
1122 |         # else:
1123 |         RC = achrom.get_RCs_vary_lcp(ns2, nr2, metric='mae')
1124 |         nr2_pred = np.array([achrom.calculate_RT(s, RC) for s in ns2])
1125 |         nr_pred = np.array([achrom.calculate_RT(s, RC) for s in ns])
1126 | 
1127 |         rt_diff_tmp = nr_pred - nr
1128 | 
1129 |         XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus_full(rt_diff_tmp)
1130 | 
1131 | 
1132 |         logger.info('First-stage calibrated RT shift: %.3f min', XRT_shift)
1133 |         logger.info('First-stage calibrated RT sigma: %.3f min', XRT_sigma)
1134 | 
1135 |         RT_sigma = XRT_sigma
1136 | 
1137 |     else:
1138 |         logger.info('No matches found')
1139 | 
1140 | 
1141 | 
1142 | 
1143 | 
1144 |     if args['ts']:
1145 | 
1146 | 
1147 |         if args['es']:
1148 |             logger.info('Use extra Stage 2 search...')
1149 | 
1150 |             qin = list(set(resdict['seqs']))
1151 |             qout = []
1152 |             pepdict = worker_RT(qin, qout, 0, 1, RC, False, False, True)
1153 | 
1154 |             rt_pred = np.array([pepdict[s] for s in resdict['seqs']])
1155 |             rt_diff = np.array([rts[iorig] for iorig in resdict['iorig']]) - rt_pred - XRT_shift
1156 |             e_all = (rt_diff) ** 2 / (RT_sigma ** 2)
1157 |             r = 9.0
1158 |             e_ind = e_all <= r
1159 |             resdict = filter_results(resdict, e_ind)
1160 | 
1161 | 
1162 | 
1163 | 
1164 |             e_ind = np.array([Isotopes[iorig] for iorig in resdict['iorig']]) >= min_isotopes_calibration
1165 |             resdict2 = filter_results(resdict, e_ind)
1166 | 
1167 | 
1168 |             e_ind = np.array([Scans[iorig] for iorig in resdict2['iorig']]) >= min_scans_calibration
1169 |             resdict2 = filter_results(resdict2, e_ind)
1170 | 
1171 |             e_ind = resdict2['mods'] == 0
1172 |             resdict2 = filter_results(resdict2, e_ind)
1173 | 
1174 | 
1175 |             if args['mc'] > 0:
1176 |                 e_ind = resdict2['mc'] == 0
1177 |                 resdict2 = filter_results(resdict2, e_ind)
1178 | 
1179 |             p1 = set(resdict2['seqs'])
1180 | 
1181 | 
1182 |             # Calculate basic protein scores including homologues
1183 |             prots_spc, p = calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=False, p=False)
1184 |             # Calculate basic protein scores excluding homologues
1185 |             prots_spc, p = calc_protein_scores(p1, pept_prot, protsN, isdecoy_key, prefix, best_base_results=prots_spc, p=p)
1186 | 
1187 | 
1188 | 
1189 | 
1190 | 
1191 |             filtered_prots = aux.filter(prots_spc.items(), fdr=0.05, key=escore, is_decoy=isdecoy, remove_decoy=True, formula=1,
1192 |                                         full_output=True)
1193 | 
1194 |             identified_proteins = 0
1195 | 
1196 |             for x in filtered_prots:
1197 |                 identified_proteins += 1
1198 |             logger.info('Stage 2 search: identified proteins = %d', identified_proteins)
1199 |             if identified_proteins <= 25:
1200 |                 logger.info('Low number of identified proteins, using first 25 top scored proteins for calibration...')
1201 |                 filtered_prots = sorted(prots_spc.items(), key=lambda x: -x[1])[:25]
1202 | 
1203 | 
1204 | 
1205 | 
1206 | 
1207 | 
1208 | 
1209 |             e_ind = np.array([Isotopes[iorig] for iorig in resdict['iorig']]) >= 1
1210 |             resdict2 = filter_results(resdict, e_ind)
1211 | 
1212 |             e_ind = resdict2['mods'] == 0
1213 |             resdict2 = filter_results(resdict2, e_ind)
1214 | 
1215 |             if args['mc'] > 0:
1216 |                 e_ind = resdict2['mc'] == 0
1217 |                 resdict2 = filter_results(resdict2, e_ind)
1218 | 
1219 | 
1220 |             true_seqs = []
1221 |             true_rt = []
1222 |             true_isotopes = []
1223 |             true_prots = set(x[0] for x in filtered_prots)
1224 |             for pep, proteins in pept_prot.items():
1225 |                 if any(protein in true_prots for protein in proteins):
1226 |                     true_seqs.append(pep)
1227 |             e_ind = np.in1d(resdict2['seqs'], true_seqs)
1228 | 
1229 | 
1230 |             true_seqs = resdict2['seqs'][e_ind]
1231 | 
1232 |             true_rt.extend(np.array([rts[iorig] for iorig in resdict2['iorig']])[e_ind])
1233 |             true_rt = np.array(true_rt)
1234 |             true_isotopes.extend(np.array([Isotopes[iorig] for iorig in resdict2['iorig']])[e_ind])
1235 |             true_isotopes = np.array(true_isotopes)
1236 | 
1237 |             e_all = abs(resdict2['md'][e_ind] - mass_shift) / (mass_sigma)
1238 |             zs_all_tmp = e_all ** 2
1239 | 
1240 |             zs_all_tmp += (true_isotopes.max() - true_isotopes) * 100
1241 | 
1242 |             e_ind = np.argsort(zs_all_tmp)
1243 |             true_seqs = true_seqs[e_ind]
1244 |             true_rt = true_rt[e_ind]
1245 | 
1246 |             idx_limit = len(true_seqs)
1247 |             cnt_pep = len(set(true_seqs))
1248 |             while cnt_pep > 2500:
1249 |                 idx_limit -= 1
1250 |                 cnt_pep = len(set(true_seqs[:idx_limit]))
1251 |             true_seqs = true_seqs[:idx_limit]
1252 |             true_rt = true_rt[:idx_limit]
1253 | 
1254 |             per_ind = np.random.RandomState(seed=SEED).permutation(len(true_seqs))
1255 |             true_seqs = true_seqs[per_ind]
1256 |             true_rt = true_rt[per_ind]
1257 | 
1258 |             best_seq = defaultdict(list)
1259 |             newseqs = []
1260 |             newRTs = []
1261 |             for seq, RT in zip(true_seqs, true_rt):
1262 |                 best_seq[seq].append(RT)
1263 |             for k, v in best_seq.items():
1264 |                 newseqs.append(k)
1265 |                 newRTs.append(np.median(v))
1266 |             true_seqs = np.array(newseqs)
1267 |             true_rt = np.array(newRTs)
1268 | 
1269 |             if calib_path:
1270 |                 df1 = pd.read_csv(calib_path, sep='\t')
1271 |                 true_seqs = df1['peptide'].values
1272 |                 true_rt = df1['RT exp'].values
1273 | 
1274 |                 ll = len(true_seqs)
1275 |                 true_seqs2 = true_seqs[int(ll/2):]
1276 |                 true_rt2 = true_rt[int(ll/2):]
1277 |                 true_seqs = true_seqs[:int(ll/2)]
1278 |                 true_rt = true_rt[:int(ll/2)]
1279 | 
1280 |             else:
1281 | 
1282 |                 ll = len(true_seqs)
1283 | 
1284 |                 true_seqs2 = true_seqs[int(ll/2):]
1285 |                 true_rt2 = true_rt[int(ll/2):]
1286 |                 true_seqs = true_seqs[:int(ll/2)]
1287 |                 true_rt = true_rt[:int(ll/2)]
1288 | 
1289 |             ns = true_seqs
1290 |             nr = true_rt
1291 |             ns2 = true_seqs2
1292 |             nr2 = true_rt2
1293 | 
1294 | 
1295 | 
1296 | 
1297 |         else:
1298 |             ns = np.array(ns)
1299 |             nr = np.array(nr)
1300 |             idx = np.abs((rt_diff_tmp) - XRT_shift) <= 3 * XRT_sigma
1301 |             ns = ns[idx]
1302 |             nr = nr[idx]
1303 | 
1304 |             rt_diff_tmp2 = nr2_pred - nr2
1305 |             ns2 = np.array(ns2)
1306 |             nr2 = np.array(nr2)
1307 |             idx = np.abs((rt_diff_tmp2) - XRT_shift) <= 3 * XRT_sigma
1308 |             ns2 = ns2[idx]
1309 |             nr2 = nr2[idx]
1310 | 
1311 |         logger.info('Second-stage peptides used for RT prediction: %d', len(ns))
1312 | 
1313 |         if deeplc_path:
1314 | 
1315 |             dlc = DeepLC(verbose=False, batch_num=args['deeplc_batch_num'], path_model=path_model, write_library=write_library, use_library=path_to_lib, pygam_calibration=False)
1316 | 
1317 | 
1318 |             df_for_calib = pd.DataFrame({
1319 |                 'seq': ns2,
1320 |                 'modifications': [utils.mods_for_deepLC(seq, aa_to_psi) for seq in ns2],
1321 |                 'tr': nr2,
1322 |             })
1323 | 
1324 |             df_for_check = pd.DataFrame({
1325 |                 'seq': ns,
1326 |                 'modifications': [utils.mods_for_deepLC(seq, aa_to_psi) for seq in ns],
1327 |                 'tr': nr,
1328 |             })
1329 | 
1330 | 
1331 |             try:
1332 |                 dlc.calibrate_preds(seq_df=df_for_calib, check_df=df_for_check)
1333 |             except:
1334 |                 dlc.calibrate_preds(seq_df=df_for_calib)
1335 | 
1336 |             df_for_check['pr'] =  dlc.make_preds(seq_df=df_for_check)
1337 | 
1338 |             rt_diff_tmp = df_for_check['pr'] - df_for_check['tr']
1339 | 
1340 |             XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus_full(rt_diff_tmp)
1341 | 
1342 |         else:
1343 | 
1344 |             RC = achrom.get_RCs_vary_lcp(ns, nr, metric='mae')
1345 |             RT_pred = np.array([achrom.calculate_RT(s, RC) for s in ns])
1346 | 
1347 |             rt_diff_tmp = RT_pred - nr
1348 | 
1349 |             XRT_shift, XRT_sigma, covvalue = calibrate_RT_gaus_full(rt_diff_tmp)
1350 | 
1351 |         RT_sigma = XRT_sigma
1352 | 
1353 |     logger.info('Second-stage calibrated RT shift: %.3f min', XRT_shift)
1354 |     logger.info('Second-stage calibrated RT sigma: %.3f min', XRT_sigma)
1355 | 
1356 |     out_log.write('Calibrated RT shift: %.3f min\n' % (XRT_shift, ))
1357 |     out_log.write('Calibrated RT sigma: %.3f min\n' % (XRT_sigma, ))
1358 | 
1359 |     p1 = set(resdict['seqs'])
1360 | 
1361 |     n = args['nproc']
1362 | 
1363 | 
1364 |     def divide_chunks(l, n):
1365 |         for i in range(0, len(l), n):
1366 |             yield l[i:i + n]
1367 | 
1368 |     if deeplc_path:
1369 | 
1370 |         pepdict = dict()
1371 | 
1372 |         if args['save_calib']:
1373 |             with open(base_out_name + '_calib.tsv', 'w') as output:
1374 |                 output.write('peptide\tRT exp\n')
1375 |                 for seq, RT in zip(ns, nr):
1376 |                     output.write('%s\t%s\n' % (seq, str(RT)))
1377 |                 for seq, RT in zip(ns2, nr2):
1378 |                     output.write('%s\t%s\n' % (seq, str(RT)))
1379 | 
1380 |         seqs_batch = list(p1)
1381 | 
1382 |         df_for_check = pd.DataFrame({
1383 |             'seq': seqs_batch,
1384 |             'modifications': [utils.mods_for_deepLC(seq, aa_to_psi) for seq in seqs_batch],
1385 |         })
1386 | 
1387 |         df_for_check['pr'] =  dlc.make_preds(seq_df=df_for_check)
1388 | 
1389 |         pepdict_batch = df_for_check.set_index('seq')['pr'].to_dict()
1390 | 
1391 |         pepdict.update(pepdict_batch)
1392 | 
1393 | 
1394 |     else:
1395 | 
1396 |         qin = list(p1)
1397 |         qout = []
1398 |         pepdict = worker_RT(qin, qout, 0, 1, RC, False, False, True)
1399 | 
1400 |     rt_pred = np.array([pepdict[s] for s in resdict['seqs']])
1401 |     # rt_diff = np.array([rts[iorig] for iorig in resdict['iorig']]) - rt_pred
1402 |     rt_diff = np.array([rts[iorig] for iorig in resdict['iorig']]) - rt_pred - XRT_shift
1403 |     # rt_diff = resdict['rt'] - rt_pred
1404 |     e_all = (rt_diff) ** 2 / (RT_sigma ** 2)
1405 |     r = 9.0
1406 |     e_ind = e_all <= r
1407 | 
1408 |     resdict = filter_results(resdict, e_ind)
1409 |     rt_diff = rt_diff[e_ind]
1410 |     rt_pred = rt_pred[e_ind]
1411 | 
1412 | 
1413 |     logger.info('RT prediction was finished')
1414 | 
1415 | 
1416 |     with open(base_out_name + '_protsN.tsv', 'w') as output:
1417 |         output.write('dbname\ttheor peptides\n')
1418 |         for k, v in protsN.items():
1419 |             output.write('\t'.join((k, str(v))) + '\n')
1420 | 
1421 |     with open(base_out_name + '_PFMs.tsv', 'w') as output:
1422 |         output.write('sequence\tmass diff\tRT diff\tpeak_id\tIntensity\tIntensitySum\tnScans\tnIsotopes\tproteins\tm/z\tRT\taveragineCorr\tcharge\tion_mobility\n')
1423 |         for seq, md, rtd, iorig in zip(resdict['seqs'], resdict['md'], rt_diff, resdict['iorig']):
1424 |             peak_id = ids[iorig]
1425 |             I = Is[iorig]
1426 |             Isum = Isums[iorig]
1427 |             nScans = Scans[iorig]
1428 |             nIsotopes = Isotopes[iorig]
1429 |             mzr = mzraw[iorig]
1430 |             rtr = rts[iorig]
1431 |             av = avraw[iorig]
1432 |             ch = charges[iorig]
1433 |             im = imraw[iorig]
1434 |             output.write('\t'.join((seq, str(md), str(rtd), str(peak_id), str(I), str(Isum), str(nScans), str(nIsotopes), ';'.join(pept_prot[seq]), str(mzr), str(rtr), str(av), str(ch), str(im))) + '\n')
1435 | 
1436 |     mass_diff = (resdict['md'] - mass_shift) / (mass_sigma)
1437 | 
1438 |     rt_diff = (np.array([rts[iorig] for iorig in resdict['iorig']]) - rt_pred - XRT_shift) / RT_sigma
1439 | 
1440 |     ## Will it override the settings provided in args? :k
1441 |     prefix = 'DECOY_'
1442 |     isdecoy = lambda x: x[0].startswith(prefix)
1443 |     isdecoy_key = lambda x: x.startswith(prefix)
1444 |     escore = lambda x: -x[1]
1445 | 
1446 |     param_grid = {
1447 |         'boosting_type': ['gbdt', ],
1448 |         'num_leaves': list(range(10, 1000)),
1449 |         'learning_rate': list(np.logspace(np.log10(0.001), np.log10(0.3), base = 10, num = 1000)),
1450 |         'min_child_samples': list(range(1, 1000, 5)),
1451 |         'reg_alpha': list(np.linspace(0, 1)),
1452 |         'reg_lambda': list(np.linspace(0, 1)),
1453 |         'colsample_bytree': list(np.linspace(0.01, 1, 100)),
1454 |         'subsample': list(np.linspace(0.01, 1, 100)),
1455 |         'is_unbalance': [True, False],
1456 |         'metric': ['rmse', ],
1457 |         'verbose': [-1, ],
1458 |         'num_threads': [args['nproc'], ],
1459 |         # 'use_quantized_grad': [True, ]
1460 |     }
1461 | 
1462 |     def get_X_array(df, feature_columns):
1463 |         return df.loc[:, feature_columns].values
1464 | 
1465 |     def get_Y_array_pfms(df):
1466 |         return df.loc[:, 'decoy'].values
1467 | 
1468 |     def get_features_pfms(dataframe):
1469 |         feature_columns = dataframe.columns
1470 |         columns_to_remove = []
1471 |         banned_features = {
1472 |             'iorig',
1473 |             'ids',
1474 |             'seqs',
1475 |             'decoy',
1476 |             'preds',
1477 |             'av',
1478 |             'Is',
1479 |             # 'Scans',
1480 |             'proteins',
1481 |             'peptide',
1482 |             'md',
1483 |             'qpreds',
1484 |             'decoy1',
1485 |             'decoy2',
1486 |             'top_25_targets',
1487 |             'G',
1488 |         }
1489 | 
1490 |         for feature in feature_columns:
1491 |             if feature in banned_features:
1492 |                 columns_to_remove.append(feature)
1493 |         feature_columns = feature_columns.drop(columns_to_remove)
1494 |         return feature_columns
1495 | 
1496 |     def objective_pfms(df, hyperparameters, iteration, threshold=0):
1497 |         """Objective function for grid and random search. Returns
1498 |         the cross validation score from a set of hyperparameters."""
1499 | 
1500 |         all_res = []
1501 | 
1502 |         for group_val in range(3):
1503 | 
1504 |             mask = df['G'] == group_val
1505 |             test_df = df[mask]
1506 |             test_ids = set(test_df['ids'])
1507 |             train_df = df[(~mask) & (df['ids'].apply(lambda x: x not in test_ids))]
1508 | 
1509 | 
1510 | 
1511 |             feature_columns = get_features_pfms(df)
1512 |             model = get_cat_model_final_pfms(train_df[~train_df['decoy2']], hyperparameters, feature_columns)
1513 | 
1514 |             df.loc[mask, 'preds'] = model.predict(get_X_array(df.loc[mask, :], feature_columns))
1515 | 
1516 |             test_df = df[mask]
1517 | 
1518 |             fpr, tpr, thresholds = metrics.roc_curve(get_Y_array_pfms(test_df[~test_df['decoy2']]), test_df[~test_df['decoy2']]['preds'])
1519 |             shr_v = metrics.auc(fpr, tpr)
1520 | 
1521 |             all_res.append(shr_v)
1522 | 
1523 |             if shr_v < threshold:
1524 |                 all_res = [0, ]
1525 |                 break
1526 | 
1527 |         shr_v = np.mean(all_res)
1528 | 
1529 |         return np.array([shr_v, hyperparameters, iteration, all_res], dtype=object)
1530 | 
1531 |     def random_search_pfms(df, param_grid, out_file, max_evals):
1532 |         """Random search for hyperparameter optimization.
1533 |         Writes result of search to csv file every search iteration."""
1534 | 
1535 |         threshold = 0
1536 | 
1537 | 
1538 | 
1539 |         # Dataframe for results
1540 |         results = pd.DataFrame(columns = ['sharpe', 'params', 'iteration', 'all_res'],
1541 |                                     index = list(range(max_evals)))
1542 |         for i in range(max_evals):
1543 | 
1544 |             # Choose random hyperparameters
1545 |             random_params = {k: np.random.RandomState().choice(v, 1)[0] for k, v in param_grid.items()}
1546 | 
1547 |             # Evaluate randomly selected hyperparameters
1548 |             eval_results = objective_pfms(df, random_params, i, threshold)
1549 |             results.loc[i, :] = eval_results
1550 | 
1551 |             threshold = max(threshold, np.mean(eval_results[3]) - 3 * np.std(eval_results[3]))
1552 | 
1553 |             # open connection (append option) and write results
1554 |             of_connection = open(out_file, 'a')
1555 |             writer = csv.writer(of_connection)
1556 |             writer.writerow(eval_results)
1557 |             of_connection.close()
1558 | 
1559 |         # Sort with best score on top
1560 |         results.sort_values('sharpe', ascending = False, inplace = True)
1561 |         results.reset_index(inplace = True)
1562 | 
1563 |         return results
1564 | 
1565 | 
1566 | 
1567 |     def get_cat_model_pfms(df, hyperparameters, feature_columns, train, test):
1568 |         feature_columns = list(feature_columns)
1569 |         dtrain = lgb.Dataset(get_X_array(train, feature_columns), get_Y_array_pfms(train), feature_name=feature_columns, free_raw_data=False)
1570 |         dvalid = lgb.Dataset(get_X_array(test, feature_columns), get_Y_array_pfms(test), feature_name=feature_columns, free_raw_data=False)
1571 |         np.random.seed(SEED)
1572 |         evals_result = {}
1573 |         model = lgb.train(hyperparameters, dtrain, num_boost_round=500, valid_sets=(dvalid,), valid_names=('valid',), verbose_eval=False,
1574 |                     early_stopping_rounds=10, evals_result=evals_result)
1575 |         return model
1576 | 
1577 |     def get_cat_model_final_pfms(df, hyperparameters, feature_columns):
1578 |         feature_columns = list(feature_columns)
1579 |         train = df
1580 |         dtrain = lgb.Dataset(get_X_array(train, feature_columns), get_Y_array_pfms(train), feature_name=feature_columns, free_raw_data=False)
1581 |         np.random.seed(SEED)
1582 |         model = lgb.train(hyperparameters, dtrain, num_boost_round=100)
1583 | 
1584 |         return model
1585 | 
1586 | 
1587 |     logger.info('Prepare ML features')
1588 | 
1589 |     df1 = pd.DataFrame()
1590 |     for k in resdict.keys():
1591 |         df1[k] = resdict[k]
1592 | 
1593 |     df1['ids'] = ids[df1['iorig'].values]
1594 |     df1['Is'] = Is[df1['iorig'].values]
1595 |     df1['Scans'] = Scans[df1['iorig'].values]
1596 |     df1['Isotopes'] = Isotopes[df1['iorig'].values]
1597 |     df1['mzraw'] = mzraw[df1['iorig'].values]
1598 |     df1['rt'] = rts[df1['iorig'].values]
1599 |     df1['av'] = avraw[df1['iorig'].values]
1600 |     df1['ch'] = charges[df1['iorig'].values]
1601 |     df1['im'] = imraw[df1['iorig'].values]
1602 | 
1603 |     df1['mass_diff'] = mass_diff
1604 |     df1['rt_diff'] = rt_diff
1605 |     df1['decoy'] = df1['seqs'].apply(lambda x: all(z.startswith(prefix) for z in pept_prot[x]))
1606 | 
1607 |     df1['peptide'] = df1['seqs']
1608 |     # mass_dict = {}
1609 |     # pI_dict = {}
1610 |     # charge_dict = {}
1611 |     # for pep in set(df1['peptide']):
1612 |     #     try:
1613 |     #         mass_dict[pep] = mass.fast_mass2(pep)
1614 |     #         pI_dict[pep] = electrochem.pI(pep)
1615 |     #         charge_dict[pep] = electrochem.charge(pep, pH=7.0)
1616 |     #     except:
1617 |     #         mass_dict[pep] = 0
1618 |     #         pI_dict[pep] = 0
1619 |     #         charge_dict[pep] = 0
1620 | 
1621 |     df1['plen'] = df1['peptide'].apply(lambda z: len(z))
1622 | 
1623 | 
1624 |     # df1['mass'] = df1['peptide'].apply(lambda x: mass_dict[x])
1625 |     # df1['pI'] = df1['peptide'].apply(lambda x: pI_dict[x])
1626 |     # df1['charge_theor'] = df1['peptide'].apply(lambda x: charge_dict[x])
1627 | 
1628 |     for aa in mass.std_aa_mass:
1629 |         df1['c_%s' % (aa, )] = df1['peptide'].apply(lambda x: x.count(aa))
1630 |     df1['c_DP'] = df1['peptide'].apply(lambda x: x.count('DP'))
1631 |     df1['c_KP'] = df1['peptide'].apply(lambda x: x.count('KP'))
1632 |     df1['c_RP'] = df1['peptide'].apply(lambda x: x.count('RP'))
1633 | 
1634 |     df1['rt_diff_abs'] = df1['rt_diff'].abs()
1635 |     df1['rt_diff_abs_pdiff'] = df1['rt_diff_abs'] - df1.groupby('ids')['rt_diff_abs'].transform('median')
1636 |     df1['rt_diff_abs_pnorm'] = df1['rt_diff_abs'] / (df1.groupby('ids')['rt_diff_abs'].transform('sum') + 1e-2)
1637 | 
1638 |     df1['mass_diff_abs'] = df1['mass_diff'].abs()
1639 |     df1['mass_diff_abs_pdiff'] = df1['mass_diff_abs'] - df1.groupby('ids')['mass_diff_abs'].transform('median')
1640 |     df1['mass_diff_abs_pnorm'] = df1['mass_diff_abs'] / (df1.groupby('ids')['mass_diff_abs'].transform('sum') + 1e-2)
1641 | 
1642 |     df1['id_count'] = df1.groupby('ids')['mass_diff'].transform('count')
1643 | 
1644 |     #limited CSD information
1645 |     if args['csd'] == 1:
1646 |         logging.info('Using limited CSD')
1647 |         df1['seq_count'] = df1.groupby('peptide')['mass_diff'].transform('count')
1648 |         df1['charge_count'] = df1.groupby('peptide')['ch'].transform('nunique')
1649 |         df1['im_count'] = df1.groupby('peptide')['im'].transform('nunique')
1650 | 
1651 |     #complete CSD information
1652 |     elif args['csd'] == 2:
1653 |         logging.info('Using complete CSD')
1654 |         #imports and functions: should be global?
1655 |         import json
1656 |         from keras import models
1657 | 
1658 |         os.environ["CUDA_VISIBLE_DEVICES"] = ""
1659 | 
1660 |         #constants
1661 |         PROTON = 1.00727646677
1662 |         TINY = 1e-6
1663 | 
1664 |         psi_to_single_ptm = {'[Acetyl]-': 'B',
1665 |                              '[Carbamidomethyl]': '',
1666 |                              'M[Oxidation]': 'J',
1667 |                              'S[Phosphorylation]': 'X',
1668 |                              'T[Phosphorylation]': 'O',
1669 |                              'Y[Phosphorylation]': 'U'}
1670 |         #functions
1671 |         def reshapeOneHot(X):
1672 |             X = np.dstack(X)
1673 |             X = np.swapaxes(X, 1, 2)
1674 |             X = np.swapaxes(X, 0, 1)
1675 |             return X
1676 | 
1677 |         def get_single_ptm_code(psi_sequence):
1678 |             sequence = psi_sequence
1679 |             for ptm in psi_to_single_ptm:
1680 |                 sequence = sequence.replace(ptm, psi_to_single_ptm[ptm])
1681 |             return sequence
1682 | 
1683 |         def one_hot_encode_peptide(psi_sequence, MAX_LENGTH = 40):
1684 |             peptide = get_single_ptm_code(psi_sequence)
1685 |             if len(peptide) > MAX_LENGTH:
1686 |                 print('Peptide length is larger than maximal length of ', str(MAX_LENGTH))
1687 |                 return ['', None]
1688 |             else:
1689 |                 AA_vocabulary = 'KRPTNAQVSGILCMJHFYWEDBXOU'#B: acetyl; J: oxidized Met; O: PhosphoT; X: PhosphoS; U: PhosphoY
1690 | 
1691 |                 one_hot_peptide = np.zeros((len(peptide), len(AA_vocabulary)))
1692 | 
1693 |                 for j in range(0, len(peptide)):
1694 |                     try:
1695 |                         aa = peptide[j]
1696 |                         one_hot_peptide[j, AA_vocabulary.index(aa)] = 1
1697 |                     except:
1698 |                         print(f'WARN: "{aa}" is not in vocabulary; it will be skipped')
1699 | 
1700 |                 no_front_paddings = int((MAX_LENGTH - len(peptide))/2)
1701 |                 peptide_front_paddings = np.zeros((no_front_paddings, one_hot_peptide.shape[1]))
1702 | 
1703 |                 no_back_paddings = MAX_LENGTH - len(peptide) - no_front_paddings
1704 |                 peptide_back_paddings = np.zeros((no_back_paddings, one_hot_peptide.shape[1]))
1705 | 
1706 |                 full_one_hot_peptide = np.vstack((peptide_front_paddings, one_hot_peptide, peptide_back_paddings))
1707 | 
1708 |                 return peptide, full_one_hot_peptide
1709 | 
1710 |         def expand_charges(row, min_charge, max_charge, mode=1):
1711 |             charge = np.arange(min_charge, max_charge + 1).reshape(-1, 1)
1712 |             mz = row['massCalib']/charge + PROTON * mode
1713 |             index = [f'{int(row["id"])}_{z[0]}' for z in charge]
1714 |             result = pd.DataFrame(mz, columns=['mz'], index=index)
1715 |             result['z'] = charge
1716 |             result['rt_start'] = row['rtStart']
1717 |             result['rt_end'] = row['rtEnd']
1718 |             if 'FAIMS' in row.keys():
1719 |                 result['scan_filter'] = f'ms cv={row["FAIMS"]}'
1720 | 
1721 |             return result
1722 | 
1723 |         def AUC(x, y):
1724 |             return (0.5 * (y[1:] + y[:-1]) * (x[1:] - x[:-1])).sum()
1725 | 
1726 |         def aggregate_csd(g):
1727 |             result = {k: v for k,v in zip(g['z'], g['area'])}
1728 |             result['pepid'] = g['pepid'].iloc[0]
1729 |             return result
1730 | 
1731 |         #imports and functions: end
1732 | 
1733 |         logger.info('Running charge-state distribution prediction')
1734 | 
1735 |         ## extracting feature CSDs
1736 | 
1737 |         faims_mode = np.any(df_features['FAIMS'] != 0)
1738 |         logger.debug(f'FAIMS mode: {faims_mode}')
1739 | 
1740 |         #find all features that has been maped to sequence
1741 |         if faims_mode:
1742 |             used_features = df_features.loc[df1['ids'].unique(), ['massCalib', 'rtStart', 'rtEnd', 'FAIMS']].copy()
1743 |         else:
1744 |             used_features = df_features.loc[df1['ids'].unique(), ['massCalib', 'rtStart', 'rtEnd']].copy()
1745 |         used_features['id'] = used_features.index.astype(int)
1746 | 
1747 |         logger.debug(f'{used_features.shape[0]} features has been mapped to peptide sequence')
1748 | 
1749 |         #expand each feature to 1 - 4 charge states
1750 |         mz_table = pd.concat(used_features.apply(expand_charges, args=(1, 4), axis=1).tolist())
1751 |         #need better way to filter impossible m/z-s
1752 |         mzmin = np.floor(mzraw.min())
1753 |         mzmax = np.ceil(mzraw.max())
1754 |         mz_table = mz_table.query('mz >= @mzmin & mz <= @mzmax').copy()
1755 | 
1756 |         logger.debug(f'{mz_table.shape[0]} XICs scheduled for extraction')
1757 | 
1758 |         #writing JSON input for TRFP
1759 |         mz_table['tolerance'] = 5
1760 |         mz_table['tolerance_unit'] = 'ppm'
1761 |         mz_table['comment'] = mz_table.index
1762 |         trfp_xic_in = os.path.join(tempfile.gettempdir(), os.urandom(24).hex())
1763 |         with open(trfp_xic_in, 'w') as out:
1764 |             json.dump(mz_table.drop(['z'], axis=1).apply(lambda row: {k:v for k,v in zip(row.index, row)}, axis=1).tolist(), out)
1765 | 
1766 |         #running TRFP to extract XICs (needs RAW file)
1767 |         trfp_xic_out = os.path.join(tempfile.gettempdir(), os.urandom(24).hex())
1768 | 
1769 |         trfp_params = [args['trfp'], 'xic', '-i', f'{base_out_name[:-9]}.raw', '-b', trfp_xic_out, '-j', trfp_xic_in]
1770 | 
1771 |         if os.name != 'nt' and os.path.splitext(args['trfp'])[1].lower() == '.exe':
1772 |             trfp_params.insert(0, 'mono')
1773 | 
1774 |         logger.info('Running XIC extraction')
1775 | 
1776 |         try:
1777 |             subprocess.check_output(trfp_params)
1778 |             os.remove(trfp_xic_in)
1779 |         except subprocess.CalledProcessError as ex:
1780 |             logger.error(f'TRFP execution failed\nOutput:\n{ex.output}')
1781 |             raise Exception('TRFP XIC extraction failed')
1782 | 
1783 |         #read TRFP output
1784 |         with open(trfp_xic_out, 'r') as xic_input:
1785 |             chromatograms = json.load(xic_input)
1786 | 
1787 |         os.remove(trfp_xic_out)
1788 | 
1789 |         #calculate AUCs
1790 |         mz_table['area'] = [AUC(np.array(cr['RetentionTimes']), np.array(cr['Intensities'])) for cr in chromatograms['Content']]
1791 |         mz_table['pepid'] = mz_table['comment'].str.split('_', expand=True).iloc[:, 0].astype(int)
1792 | 
1793 |         #building CSD table
1794 |         csd_table = pd.DataFrame(mz_table.groupby('pepid').apply(aggregate_csd).tolist())
1795 |         csd_table.rename(columns={k:v for k,v in zip(csd_table.columns, csd_table.columns.astype(str))}, inplace=True)
1796 |         csd_table = csd_table.reindex(columns=sorted(csd_table.columns))
1797 | 
1798 |         #add CSD information
1799 |         used_features = used_features.join(csd_table.set_index('pepid', verify_integrity=True), how='left')
1800 | 
1801 |         #map CSD of features for the PFM information
1802 |         pfm_csd = used_features.reindex(df1['ids'])[['1', '2', '3', '4']].values
1803 | 
1804 |         ## predict CSDs for all sequences using the model
1805 | 
1806 |         #loading model
1807 |         CSD_model = models.load_model(
1808 |             os.path.normpath(os.path.join(os.path.dirname(__file__), 'models', 'CSD_model_LCMSMS.hdf5')),
1809 |             compile=False)
1810 |         CSD_model.compile()
1811 | 
1812 |         #using only unique sequences for prediction
1813 |         unique_sequences = pd.DataFrame(df1.drop_duplicates('seqs')['seqs'])
1814 |         CSD_X = reshapeOneHot(unique_sequences['seqs'].apply(lambda s: one_hot_encode_peptide(s)[1]).values)
1815 | 
1816 |         logger.debug(f'{unique_sequences.shape[0]} unique peptide sequence detected')
1817 |         logger.info('Running charge-state distribution prediction model')
1818 | 
1819 |         #prediction
1820 |         CSD_pred = CSD_model.predict(CSD_X, batch_size=2048)
1821 |         unique_sequences[['1', '2', '3', '4']] = CSD_pred[:, :4]
1822 |         pfm_csd_pred = unique_sequences.set_index('seqs').reindex(df1['seqs']).values
1823 | 
1824 |         #masking impossible m/z-s in prediction
1825 |         mask = np.isnan(pfm_csd)
1826 |         pfm_csd[mask] = 0
1827 |         pfm_csd_pred[mask] = 0
1828 | 
1829 |         #normalizing CSDs
1830 |         pfm_csd = pfm_csd / (pfm_csd.sum(axis=1) + TINY).reshape(-1, 1)
1831 |         pfm_csd_pred = pfm_csd_pred / (pfm_csd_pred.sum(axis=1) + TINY).reshape(-1, 1)
1832 | 
1833 |         #spectrum angle feature
1834 |         df1['z_angle'] = 1 - 2 * np.arccos((pfm_csd * pfm_csd_pred).sum(axis=1) /\
1835 |             (np.linalg.norm(pfm_csd, 2, axis=1) * np.linalg.norm(pfm_csd_pred, 2, axis=1) + TINY)) / np.pi
1836 | 
1837 |         #adding average_charge
1838 |         pfm_csd = np.concatenate([pfm_csd, (pfm_csd * np.arange(1, 5).reshape(1, 4)).sum(axis=1).reshape(-1, 1)], axis=1)
1839 |         pfm_csd_pred = np.concatenate([pfm_csd_pred, (pfm_csd_pred * np.arange(1, 5).reshape(1, 4)).sum(axis=1).reshape(-1, 1)], axis=1)
1840 | 
1841 |         #prediction error features for individual intensities and average charge
1842 |         z_deltas = pfm_csd - pfm_csd_pred
1843 |         df1[[f'z{z}_err' for z in '1234a']] = (z_deltas - z_deltas.mean(axis=0)) / z_deltas.std(axis=0).reshape(1, -1)
1844 | 
1845 |         logger.info('Charge-state distribution features added')
1846 | 
1847 |     p1 = set(resdict['seqs'])
1848 | 
1849 |     prots_spc2 = defaultdict(set)
1850 |     for pep, proteins in pept_prot.items():
1851 |         if pep in p1:
1852 |             for protein in proteins:
1853 |                 prots_spc2[protein].add(pep)
1854 | 
1855 |     for k in protsN:
1856 |         if k not in prots_spc2:
1857 |             prots_spc2[k] = set([])
1858 |     prots_spc = dict((k, len(v)) for k, v in prots_spc2.items())
1859 | 
1860 |     names_arr = np.array(list(prots_spc.keys()))
1861 |     v_arr = np.array(list(prots_spc.values()))
1862 |     n_arr = np.array([protsN[k] for k in prots_spc])
1863 | 
1864 |     top100decoy_score = [prots_spc.get(dprot, 0) for dprot in protsN if isdecoy_key(dprot)]
1865 |     top100decoy_N = [val for key, val in protsN.items() if isdecoy_key(key)]
1866 |     p = np.mean(top100decoy_score) / np.mean(top100decoy_N)
1867 |     logger.info('Stage 3 search: probability of random match for theoretical peptide = %.3f', p)
1868 | 
1869 |     prots_spc = dict()
1870 |     all_pvals = utils.calc_sf_all(v_arr, n_arr, p)
1871 |     for idx, k in enumerate(names_arr):
1872 |         prots_spc[k] = all_pvals[idx]
1873 | 
1874 |     target_prots_25_fdr = set([x[0] for x in aux.filter(prots_spc.items(), fdr=0.25, key=escore, is_decoy=isdecoy, remove_decoy=False, formula=1, full_output=True, correction=0)])
1875 |     df1['proteins'] = df1['seqs'].apply(lambda x: ';'.join(pept_prot[x]))
1876 |     df1['decoy2'] = df1['decoy']
1877 |     df1['decoy'] = df1['proteins'].apply(lambda x: all(z not in target_prots_25_fdr for z in x.split(';')))
1878 |     df1['top_25_targets'] = df1['decoy']
1879 | 
1880 | 
1881 |     if len(target_prots_25_fdr) <= 25:
1882 |         logger.info('Low number of identified proteins, turning off LightGBM...')
1883 |         filtered_prots = sorted(prots_spc.items(), key=lambda x: -x[1])[:25]
1884 |         skip_ml = 1
1885 |     else:
1886 |         skip_ml = 0
1887 | 
1888 |     if args['ml'] and not skip_ml:
1889 | 
1890 |         logger.info('Start Machine Learning on PFMs...')
1891 | 
1892 |         MAX_EVALS = 25
1893 | 
1894 |         out_file = os.path.join(tempfile.gettempdir(), os.urandom(24).hex())
1895 |         of_connection = open(out_file, 'w')
1896 |         writer = csv.writer(of_connection)
1897 | 
1898 |         headers = ['auc', 'params', 'iteration', 'all_res']
1899 |         writer.writerow(headers)
1900 |         of_connection.close()
1901 | 
1902 |         all_id_list = list(set(df1[df1['decoy']]['peptide']))
1903 |         np.random.RandomState(seed=SEED).shuffle(all_id_list)
1904 |         seq_gmap = {}
1905 |         for idx, split in enumerate(np.array_split(all_id_list, 3)):
1906 |             for id_ftr in split:
1907 |                 seq_gmap[id_ftr] = idx
1908 | 
1909 |         all_id_list = list(set(df1[~df1['decoy']]['peptide']))
1910 |         np.random.RandomState(seed=SEED).shuffle(all_id_list)
1911 |         for idx, split in enumerate(np.array_split(all_id_list, 3)):
1912 |             for id_ftr in split:
1913 |                 seq_gmap[id_ftr] = idx
1914 | 
1915 | 
1916 | 
1917 |         df1['G'] = df1['peptide'].apply(lambda x: seq_gmap[x])
1918 | 
1919 |         # train = df1[~df1['decoy']]
1920 |         # train_extra = df1[df1['decoy']]
1921 |         # train_extra = train_extra.sample(frac=min(1.0, len(train)/len(train_extra)))
1922 |         # train = pd.concat([train, train_extra], ignore_index=True).reset_index(drop=True)
1923 |         # random_results = random_search_pfms(train, param_grid, out_file, MAX_EVALS)
1924 | 
1925 |         random_results = random_search_pfms(df1, param_grid, out_file, MAX_EVALS)
1926 | 
1927 |         random_results = pd.read_csv(out_file)
1928 |         random_results = random_results[random_results['auc'] != 'auc']
1929 |         random_results['params'] = random_results['params'].apply(lambda x: ast.literal_eval(x))
1930 |         convert_dict = {'auc': float,
1931 |                     }
1932 |         random_results = random_results.astype(convert_dict)
1933 | 
1934 | 
1935 |         bestparams = random_results.sort_values(by='auc',ascending=False)['params'].values[0]
1936 | 
1937 |         bestparams['num_threads'] = args['nproc']
1938 | 
1939 | 
1940 | 
1941 |         for group_val in range(3):
1942 | 
1943 |             mask = df1['G'] == group_val
1944 |             test_df = df1[mask]
1945 |             test_ids = set(test_df['ids'])
1946 |             train_df = df1[(~mask) & (df1['ids'].apply(lambda x: x not in test_ids))]
1947 | 
1948 |             # mask2 = train['G'] != group_val
1949 | 
1950 |             # train_df = train[(mask2) & (train['ids'].apply(lambda x: x not in test_ids))]
1951 | 
1952 | 
1953 |             feature_columns = list(get_features_pfms(train_df))
1954 |             model = get_cat_model_final_pfms(train_df[~train_df['decoy2']], bestparams, feature_columns)
1955 | 
1956 |             df1.loc[mask, 'preds'] = rankdata(model.predict(get_X_array(test_df, feature_columns)), method='ordinal') / sum(mask)
1957 | 
1958 | 
1959 |     else:
1960 |         df1['preds'] = np.power(df1['mass_diff'], 2) + np.power(df1['rt_diff'], 2)
1961 | 
1962 | 
1963 | 
1964 | 
1965 |     df1['qpreds'] = pd.qcut(df1['preds'], 50, labels=range(50))
1966 | 
1967 |     df1['decoy'] = df1['decoy2']
1968 | 
1969 | 
1970 |     df1u = df1.sort_values(by='preds')
1971 |     df1u = df1u.drop_duplicates(subset='seqs')
1972 | 
1973 |     qval_ok = 0
1974 |     for qval_cur in range(50):
1975 |         df1ut = df1u[df1u['qpreds'] == qval_cur]
1976 |         decoy_ratio = df1ut['decoy'].sum() / len(df1ut)
1977 |         if decoy_ratio < ml_correction:
1978 |             qval_ok = qval_cur
1979 |         else:
1980 |             break
1981 |     logger.info('%d %% of PFMs were removed from protein scoring after Machine Learning', (100 - (qval_ok+1)*2))
1982 | 
1983 | #CHECK merge conflict was from here
1984 |     df1u = df1u[df1u['qpreds'] <= qval_ok]#.copy()
1985 | 
1986 |     df1u['qpreds'] = pd.qcut(df1u['preds'], 10, labels=range(10))
1987 | 
1988 |     qdict = df1u.set_index('seqs').to_dict()['qpreds']
1989 | 
1990 |     df1['qpreds'] = df1['seqs'].apply(lambda x: qdict.get(x, 11))
1991 | 
1992 |     # df1un = df1u[df1u['qpreds'] <= qval_ok].copy()
1993 |     # df1un['qpreds'] = pd.qcut(df1un['preds'], 10, labels=range(10))
1994 | 
1995 |     # qdict = df1un.set_index('seqs').to_dict()['qpreds']
1996 | 
1997 | 
1998 |     # df1['qpreds'] = df1['seqs'].apply(lambda x: qdict.get(x, 11))
1999 | 
2000 |     # # df1['qz'] = df1['seqs'].apply(lambda x: qval_dict.get(qdict.get(x, 11), 1.0))
2001 | 
2002 | #to here
2003 | 
2004 |     df1.to_csv(base_out_name + '_PFMs_ML.tsv', sep='\t', index=False)
2005 | 
2006 |     df1 = df1[df1['qpreds'] <= 10]
2007 | 
2008 |     resdict = {}
2009 |     resdict['seqs'] = df1['seqs'].values
2010 |     resdict['qpreds'] = df1['qpreds'].values
2011 |     resdict['ids'] = df1['ids'].values
2012 | 
2013 |     mass_diff = resdict['qpreds']
2014 |     rt_diff = []
2015 |     if skip_ml:
2016 |         mass_diff = np.zeros(len(mass_diff))
2017 | 
2018 |     p1 = set(resdict['seqs'])
2019 | 
2020 |     prots_spc2 = defaultdict(set)
2021 |     for pep, proteins in pept_prot.items():
2022 |         if pep in p1:
2023 |             for protein in proteins:
2024 |                 prots_spc2[protein].add(pep)
2025 | 
2026 |     for k in protsN:
2027 |         if k not in prots_spc2:
2028 |             prots_spc2[k] = set([])
2029 |     prots_spc = dict((k, len(v)) for k, v in prots_spc2.items())
2030 | 
2031 |     names_arr = np.array(list(prots_spc.keys()))
2032 |     v_arr = np.array(list(prots_spc.values()))
2033 |     n_arr = np.array([protsN[k] for k in prots_spc])
2034 | 
2035 |     top100decoy_score = [prots_spc.get(dprot, 0) for dprot in protsN if isdecoy_key(dprot)]
2036 |     top100decoy_N = [val for key, val in protsN.items() if isdecoy_key(key)]
2037 |     p = np.mean(top100decoy_score) / np.mean(top100decoy_N)
2038 |     logger.info('Final stage search: probability of random match for theoretical peptide = %.3f', p)
2039 | 
2040 | 
2041 |     prots_spc = dict()
2042 |     all_pvals = utils.calc_sf_all(v_arr, n_arr, p)
2043 |     for idx, k in enumerate(names_arr):
2044 |         prots_spc[k] = all_pvals[idx]
2045 | 
2046 |     sf = args['separate_figures']
2047 | 
2048 |     top_proteins = final_iteration(resdict, mass_diff, rt_diff, pept_prot, protsN, base_out_name, prefix, isdecoy, isdecoy_key, escore, fdr, args['nproc'], out_log, fname, separate_figures=sf)
2049 | 
2050 |     # pept_prot_limited = dict()
2051 |     # for k, v in pept_prot.items():
2052 |     #     tmp = set(zz for zz in v if zz in top_proteins)
2053 |     #     if len(tmp):
2054 |     #         pept_prot_limited[k] = tmp
2055 | 
2056 |     # resdict3 = get_resdict(pept_prot_limited, acc_l=25, acc_r=25, aa_mass=kwargs['aa_mass'])
2057 | 
2058 |     # p1 = set(resdict3['seqs'])
2059 | 
2060 |     # if deeplc_path:
2061 | 
2062 |     #     pepdict = dict()
2063 | 
2064 |     #     seqs_batch = list(p1)
2065 | 
2066 |     #     df_for_check = pd.DataFrame({
2067 |     #         'seq': seqs_batch,
2068 |     #         'modifications': [utils.mods_for_deepLC(seq, aa_to_psi) for seq in seqs_batch],
2069 |     #     })
2070 | 
2071 |     #     df_for_check['pr'] =  dlc.make_preds(seq_df=df_for_check)
2072 | 
2073 |     #     pepdict_batch = df_for_check.set_index('seq')['pr'].to_dict()
2074 | 
2075 |     #     pepdict.update(pepdict_batch)
2076 | 
2077 | 
2078 |     # else:
2079 | 
2080 |     #     qin = list(p1)
2081 |     #     qout = []
2082 |     #     pepdict = worker_RT(qin, qout, 0, 1, RC, False, False, True)
2083 | 
2084 |     # rt_pred = np.array([pepdict[s] for s in resdict3['seqs']])
2085 |     # rt_diff = np.array([rts[iorig] for iorig in resdict3['iorig']]) - rt_pred - XRT_shift
2086 |     # # e_all = (rt_diff) ** 2 / (RT_sigma ** 2)
2087 |     # # r = 9.0
2088 |     # # e_ind = e_all <= r
2089 |     # # resdict = filter_results(resdict, e_ind)
2090 |     # # rt_diff = rt_diff[e_ind]
2091 |     # # rt_pred = rt_pred[e_ind]
2092 | 
2093 | 
2094 | 
2095 |     # with open(base_out_name + '_PFMs_extended.tsv', 'w') as output:
2096 |     #     output.write('sequence\tmass diff\tRT diff\tpeak_id\tIntensity\tIntensitySum\tnScans\tnIsotopes\tproteins\tm/z\tRT\taveragineCorr\tcharge\tion_mobility\n')
2097 |     #     for seq, md, rtd, iorig in zip(resdict3['seqs'], resdict3['md'], rt_diff, resdict3['iorig']):
2098 |     #         peak_id = ids[iorig]
2099 |     #         I = Is[iorig]
2100 |     #         Isum = Isums[iorig]
2101 |     #         nScans = Scans[iorig]
2102 |     #         nIsotopes = Isotopes[iorig]
2103 |     #         mzr = mzraw[iorig]
2104 |     #         rtr = rts[iorig]
2105 |     #         av = avraw[iorig]
2106 |     #         ch = charges[iorig]
2107 |     #         im = imraw[iorig]
2108 |     #         output.write('\t'.join((seq, str(md), str(rtd), str(peak_id), str(I), str(Isum), str(nScans), str(nIsotopes), ';'.join(pept_prot[seq]), str(mzr), str(rtr), str(av), str(ch), str(im))) + '\n')
2109 | 
2110 | 
2111 |     # print('!!!', len(set(resdict3['seqs'])))
2112 | 
2113 |     logger.info('The search for file %s is finished.', base_out_name)
2114 | 
2115 | def worker(qin, qout, mass_diff, rt_diff, resdict, protsN, pept_prot, isdecoy_key, isdecoy, fdr, prots_spc_basic2, win_sys=False):
2116 | 
2117 |     for item in (iter(qin.get, None) if not win_sys else qin):
2118 |         mass_koef, rtt_koef = item
2119 |         e_ind = mass_diff <= mass_koef
2120 |         resdict2 = filter_results(resdict, e_ind)
2121 | 
2122 |         features_dict = dict()
2123 |         for pep in set(resdict2['seqs']):
2124 |             for bprot in pept_prot[pep]:
2125 |                 prot_score = prots_spc_basic2[bprot]
2126 |                 if prot_score > features_dict.get(pep, [-1, ])[-1]:
2127 |                     features_dict[pep] = (bprot, prot_score)
2128 | 
2129 |         prots_spc_basic = dict()
2130 | 
2131 |         p1 = set(resdict2['seqs'])
2132 | 
2133 |         pep_pid = defaultdict(set)
2134 |         pid_pep = defaultdict(set)
2135 |         banned_dict = dict()
2136 |         for pep, pid in zip(resdict2['seqs'], resdict2['ids']):
2137 |             pep_pid[pep].add(pid)
2138 |             pid_pep[pid].add(pep)
2139 |             if pep in banned_dict:
2140 |                 banned_dict[pep] += 1
2141 |             else:
2142 |                 banned_dict[pep] = 1
2143 | 
2144 |         banned_pids_total = set()
2145 | 
2146 |         if len(p1):
2147 |             prots_spc_final = dict()
2148 |             prots_spc_copy = False
2149 |             prots_spc2 = False
2150 |             unstable_prots = set()
2151 |             p0 = False
2152 | 
2153 |             prev_best_score = 1e6
2154 | 
2155 |             names_arr = False
2156 |             tmp_spc_new = False
2157 |             decoy_set = False
2158 | 
2159 |             cnt_target_final = 0
2160 |             cnt_decoy_final = 0
2161 | 
2162 |             while 1:
2163 |                 if not prots_spc2:
2164 | 
2165 |                     best_match_dict = dict()
2166 |                     n_map_dict = defaultdict(list)
2167 |                     for k, v in protsN.items():
2168 |                         n_map_dict[v].append(k)
2169 | 
2170 |                     decoy_set = set()
2171 |                     for k in protsN:
2172 |                         if isdecoy_key(k):
2173 |                             decoy_set.add(k)
2174 |                     # decoy_set = list(decoy_set)
2175 | 
2176 | 
2177 |                     prots_spc2 = defaultdict(set)
2178 |                     for pep, proteins in pept_prot.items():
2179 |                         if pep in p1:
2180 |                             for protein in proteins:
2181 |                                 if protein == features_dict[pep][0]:
2182 |                                     prots_spc2[protein].add(pep)
2183 | 
2184 |                     for k in protsN:
2185 |                         if k not in prots_spc2:
2186 |                             prots_spc2[k] = set([])
2187 |                     prots_spc2 = dict(prots_spc2)
2188 |                     unstable_prots = set(prots_spc2.keys())
2189 | 
2190 |                     top100decoy_N = sum([val for key, val in protsN.items() if isdecoy_key(key)])
2191 | 
2192 |                     names_arr = np.array(list(prots_spc2.keys()))
2193 |                     n_arr = np.array([protsN[k] for k in names_arr])
2194 | 
2195 |                     tmp_spc_new = dict((k, len(v)) for k, v in prots_spc2.items())
2196 | 
2197 | 
2198 |                     top100decoy_score_tmp = {dprot: tmp_spc_new.get(dprot, 0) for dprot in decoy_set}
2199 |                     top100decoy_score_tmp_sum = float(sum(top100decoy_score_tmp.values()))
2200 | 
2201 |                     prots_spc_basic = False
2202 | 
2203 |                 prots_spc = tmp_spc_new
2204 |                 if not prots_spc_copy:
2205 |                     prots_spc_copy = deepcopy(prots_spc)
2206 | 
2207 |                 # for idx, v in enumerate(decoy_set):
2208 |                 #     if v in unstable_prots:
2209 |                 #         top100decoy_score_tmp_sum -= top100decoy_score_tmp[idx]
2210 |                 #         top100decoy_score_tmp[idx] = prots_spc.get(v, 0)
2211 |                 #         top100decoy_score_tmp_sum += top100decoy_score_tmp[idx]
2212 | 
2213 | 
2214 |                 for v in decoy_set.intersection(unstable_prots):
2215 |                     top100decoy_score_tmp_sum -= top100decoy_score_tmp[v]
2216 |                     top100decoy_score_tmp[v] = prots_spc.get(v, 0)
2217 |                     top100decoy_score_tmp_sum += top100decoy_score_tmp[v]
2218 | 
2219 | 
2220 |                 # p = float(sum(top100decoy_score_tmp)) / top100decoy_N
2221 |                 p = top100decoy_score_tmp_sum / top100decoy_N
2222 |                 if not p0:
2223 |                     p0 = float(p)
2224 | 
2225 |                 n_change = set(protsN[k] for k in unstable_prots)
2226 |                 if len(best_match_dict) == 0:
2227 |                     n_change = sorted(n_change)
2228 |                 for n_val in n_change:
2229 |                     for k in n_map_dict[n_val]:
2230 |                         v = prots_spc[k]
2231 |                         if n_val not in best_match_dict or v > prots_spc[best_match_dict[n_val]]:
2232 |                             best_match_dict[n_val] = k
2233 |                 n_arr_small = []
2234 |                 names_arr_small = []
2235 |                 v_arr_small = []
2236 |                 max_k = 0
2237 | 
2238 |                 for k, v in best_match_dict.items():
2239 |                     num_matched = prots_spc[v]
2240 |                     if num_matched >= max_k:
2241 | 
2242 |                         max_k = num_matched
2243 | 
2244 |                         n_arr_small.append(k)
2245 |                         names_arr_small.append(v)
2246 |                         v_arr_small.append(prots_spc[v])
2247 | 
2248 |                 prots_spc_basic = dict()
2249 |                 all_pvals = utils.calc_sf_all(np.array(v_arr_small), n_arr_small, p)
2250 | 
2251 |                 for idx, k in enumerate(names_arr_small):
2252 |                     prots_spc_basic[k] = all_pvals[idx]
2253 | 
2254 |                 best_prot = utils.keywithmaxval(prots_spc_basic)
2255 | 
2256 |                 best_score = min(prots_spc_basic[best_prot], prev_best_score)
2257 |                 prev_best_score = best_score
2258 | 
2259 |                 unstable_prots = set()
2260 |                 if best_prot not in prots_spc_final:
2261 |                     prots_spc_final[best_prot] = best_score
2262 |                     banned_pids = set()
2263 |                     for pep in prots_spc2[best_prot]:
2264 |                         for pid in pep_pid[pep]:
2265 |                             if pid not in banned_pids_total:
2266 |                                 banned_pids.add(pid)
2267 |                     for pid in banned_pids:#banned_pids.difference(banned_pids_total):
2268 |                         for pep in pid_pep[pid]:
2269 |                             banned_dict[pep] -= 1
2270 |                             if banned_dict[pep] == 0:
2271 |                                 best_prot_val = features_dict[pep][0]
2272 |                                 tmp_spc_new[best_prot_val] -= 1
2273 |                                 unstable_prots.add(best_prot_val)
2274 |                                 # for bprot in pept_prot[pep]:
2275 |                                 #     if bprot == best_prot_val:
2276 |                                 #         tmp_spc_new[bprot] -= 1
2277 |                                 #         unstable_prots.add(bprot)
2278 | 
2279 |                     banned_pids_total.update(banned_pids)
2280 | 
2281 |                     if best_prot in decoy_set:
2282 |                         cnt_decoy_final += 1
2283 |                     else:
2284 |                         cnt_target_final += 1
2285 | 
2286 |                 else:
2287 | 
2288 |                     v_arr = np.array([prots_spc[k] for k in names_arr])
2289 |                     all_pvals = utils.calc_sf_all(v_arr, n_arr, p)
2290 |                     for idx, k in enumerate(names_arr):
2291 |                         prots_spc_basic[k] = all_pvals[idx]
2292 | 
2293 |                     for k, v in prots_spc_basic.items():
2294 |                         if k not in prots_spc_final:
2295 |                             prots_spc_final[k] = v
2296 | 
2297 |                     break
2298 | 
2299 |                 prots_spc_basic[best_prot] = 0
2300 | 
2301 |                 try:
2302 |                     prot_fdr = cnt_decoy_final / cnt_target_final
2303 |                     # prot_fdr = aux.fdr(prots_spc_final.items(), is_decoy=isdecoy)
2304 |                 except:
2305 |                     prot_fdr = 100.0
2306 |                 if prot_fdr >= 12.5 * fdr:
2307 | 
2308 |                     v_arr = np.array([prots_spc[k] for k in names_arr])
2309 |                     all_pvals = utils.calc_sf_all(v_arr, n_arr, p)
2310 |                     for idx, k in enumerate(names_arr):
2311 |                         prots_spc_basic[k] = all_pvals[idx]
2312 | 
2313 |                     for k, v in prots_spc_basic.items():
2314 |                         if k not in prots_spc_final:
2315 |                             prots_spc_final[k] = v
2316 |                     break
2317 | 
2318 |         if mass_koef == 9:
2319 |             item2 = prots_spc_copy
2320 |         else:
2321 |             item2 = False
2322 |         if not win_sys:
2323 |             qout.put((prots_spc_final, item2))
2324 |         else:
2325 |             qout.append((prots_spc_final, item2))
2326 |     if not win_sys:
2327 |         qout.put(None)
2328 |     else:
2329 |         return qout
2330 | 


--------------------------------------------------------------------------------
/ms1searchpy/models/CSD_model_LCMSMS.hdf5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/markmipt/ms1searchpy/ab7dd1aba1a513f71d028263972fd5e4c79e7090/ms1searchpy/models/CSD_model_LCMSMS.hdf5


--------------------------------------------------------------------------------
/ms1searchpy/ms1todiffacto.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division
  2 | import argparse
  3 | import pandas as pd
  4 | import subprocess
  5 | import logging
  6 | 
  7 | def run():
  8 |     parser = argparse.ArgumentParser(
  9 |         description='run diffacto for ms1searchpy results',
 10 |         epilog='''
 11 | 
 12 |     Example usage
 13 |     -------------
 14 |     $ ms1todiffacto -S1 sample1_1_proteins.tsv sample1_n_proteins.tsv -S2 sample2_1_proteins.tsv sample2_n_proteins.tsv
 15 |     -------------
 16 |     ''',
 17 |         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 18 | 
 19 |     parser.add_argument('-dif', help='path to Diffacto', required=True)
 20 |     parser.add_argument('-S1', nargs='+', help='input files for S1 sample', required=True)
 21 |     parser.add_argument('-S2', nargs='+', help='input files for S2 sample', required=True)
 22 |     parser.add_argument('-S3', nargs='+', help='input files for S3 sample')
 23 |     parser.add_argument('-S4', nargs='+', help='input files for S4 sample')
 24 |     parser.add_argument('-S5', nargs='+', help='input files for S5 sample')
 25 |     parser.add_argument('-S6', nargs='+', help='input files for S6 sample')
 26 |     parser.add_argument('-S7', nargs='+', help='input files for S7 sample')
 27 |     parser.add_argument('-S8', nargs='+', help='input files for S8 sample')
 28 |     parser.add_argument('-S9', nargs='+', help='input files for S9 sample')
 29 |     parser.add_argument('-S10', nargs='+', help='input files for S10 sample')
 30 |     parser.add_argument('-S11', nargs='+', help='input files for S11 sample')
 31 |     parser.add_argument('-S12', nargs='+', help='input files for S12 sample')
 32 |     parser.add_argument('-peptides', help='name of output peptides file', default='peptides.txt')
 33 |     parser.add_argument('-samples', help='name of output samples file', default='sample.txt')
 34 |     parser.add_argument('-allowed_prots', help='path to allowed prots', default='')
 35 |     parser.add_argument('-out', help='name of diffacto output file', default='diffacto_out.txt')
 36 |     parser.add_argument('-norm', help='normalization method. Can be average, median, GMM or None', default='None')
 37 |     parser.add_argument('-impute_threshold', help='impute_threshold for missing values fraction', default='0.75')
 38 |     parser.add_argument('-min_samples', help='minimum number of samples for peptide usage', default='3')
 39 |     parser.add_argument('-debug', help='Produce debugging output', action='store_true')
 40 |     args = vars(parser.parse_args())
 41 |     logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s',
 42 |             datefmt='[%H:%M:%S]', level=[logging.INFO, logging.DEBUG][args['debug']])
 43 |     logger = logging.getLogger(__name__)
 44 | 
 45 |     replace_label = '_proteins.tsv'
 46 | 
 47 |     df_final = False
 48 | 
 49 |     allowed_prots = set()
 50 |     allowed_peptides = set()
 51 |     allowed_prots_all = set()
 52 | 
 53 |     all_labels = []
 54 | 
 55 |     if not args['allowed_prots']:
 56 | 
 57 |         for i in range(1, 13, 1):
 58 |             sample_num = 'S%d' % (i, )
 59 |             if args[sample_num]:
 60 |                 for z in args[sample_num]:
 61 |                     df0 = pd.read_table(z)
 62 |                     allowed_prots.update(df0['dbname'])
 63 |     else:
 64 |         for prot in open(args['allowed_prots'], 'r'):
 65 |             allowed_prots.add(prot.strip())
 66 | 
 67 | 
 68 |     for i in range(1, 13, 1):
 69 |         sample_num = 'S%d' % (i, )
 70 |         if args[sample_num]:
 71 |             for z in args[sample_num]:
 72 |                 df0 = pd.read_table(z.replace('_proteins.tsv', '_PFMs_ML.tsv'))
 73 |                 df0 = df0[df0['qpreds'] <= 10]
 74 |                 allowed_peptides.update(df0['seqs'])
 75 | 
 76 | 
 77 |     if not args['allowed_prots']:
 78 |         for i in range(1, 13, 1):
 79 |             sample_num = 'S%d' % (i, )
 80 |             if args[sample_num]:
 81 |                 for z in args[sample_num]:
 82 |                     df3 = pd.read_table(z.replace('_proteins.tsv', '_PFMs.tsv'))
 83 |                     df3 = df3[df3['sequence'].apply(lambda x: x in allowed_peptides)]
 84 | 
 85 |                     df3_tmp = df3[df3['proteins'].apply(lambda x: any(z in allowed_prots for z in x.split(';')))]
 86 |                     for dbnames in set(df3_tmp['proteins'].values):
 87 |                         for dbname in dbnames.split(';'):
 88 |                             allowed_prots_all.add(dbname)
 89 |     else:
 90 |         allowed_prots_all = allowed_prots
 91 | 
 92 | 
 93 |     for i in range(1, 13, 1):
 94 |         sample_num = 'S%d' % (i, )
 95 |         if args[sample_num]:
 96 |             for z in args[sample_num]:
 97 |                 label = z.replace(replace_label, '')
 98 |                 all_labels.append(label)
 99 |                 df3 = pd.read_table(z.replace(replace_label, '_PFMs.tsv'))
100 |                 logger.debug(z)
101 |                 logger.debug(z.replace(replace_label, '_PFMs.tsv'))
102 |                 logger.debug(df3.shape)
103 |                 logger.debug(df3.columns)
104 | 
105 |                 df3 = df3[df3['proteins'].apply(lambda x: any(z in allowed_prots_all for z in x.split(';')))]
106 |                 df3['proteins'] = df3['proteins'].apply(lambda x: ';'.join([z for z in x.split(';') if z in allowed_prots_all]))
107 | 
108 |                 df3['origseq'] = df3['sequence']
109 |                 df3['sequence'] = df3['sequence'] + df3['charge'].astype(int).astype(str) + df3['ion_mobility'].astype(str)
110 | 
111 |                 df3 = df3.sort_values(by='Intensity', ascending=False)
112 |                 df3 = df3.drop_duplicates(subset='sequence')
113 |                 # df3 = df3.explode('proteins')
114 | 
115 |                 df3[label] = df3['Intensity']
116 |                 df3['protein'] = df3['proteins']
117 |                 df3['peptide'] = df3['sequence']
118 |                 df3 = df3[['origseq', 'peptide', 'protein', label]]
119 | 
120 | 
121 | 
122 |                 if df_final is False:
123 |                     df_final = df3.reset_index(drop=True)
124 |                 else:
125 |                     df_final = df_final.reset_index(drop=True).merge(df3.reset_index(drop=True), on='peptide', how='outer')
126 |                     df_final.protein_x.fillna(value=df_final.protein_y, inplace=True)
127 |                     df_final.origseq_x.fillna(value=df_final.origseq_y, inplace=True)
128 |                     df_final['protein'] = df_final['protein_x']
129 |                     df_final['origseq'] = df_final['origseq_x']
130 | 
131 |                     df_final = df_final.drop(columns=['protein_x', 'protein_y'])
132 |                     df_final = df_final.drop(columns=['origseq_x', 'origseq_y'])
133 | 
134 | 
135 |     df_final['intensity_median'] = df_final[all_labels].median(axis=1)
136 |     df_final['nummissing'] = df_final[all_labels].isna().sum(axis=1)
137 |     logger.debug(df_final['nummissing'])
138 |     df_final = df_final.sort_values(by=['nummissing', 'intensity_median'], ascending=(True, False))
139 |     df_final = df_final.drop_duplicates(subset=('origseq', 'protein'))
140 | 
141 |     logger.debug(df_final.columns)
142 |     df_final = df_final.set_index('peptide')
143 |     df_final['proteins'] = df_final['protein']
144 |     df_final = df_final.drop(columns=['protein'])
145 |     cols = df_final.columns.tolist()
146 |     cols.remove('proteins')
147 |     cols.insert(0, 'proteins')
148 |     df_final = df_final[cols]
149 |     df_final.fillna(value='')
150 |     df_final.to_csv(args['peptides'], sep=',')
151 | 
152 |     out = open(args['samples'], 'w')
153 |     for i in range(1, 13, 1):
154 |         sample_num = 'S%d' % (i, )
155 |         if args[sample_num]:
156 |             for z in args[sample_num]:
157 |                 label = z.replace(replace_label, '')
158 |                 out.write(label + '\t' + sample_num + '\n')
159 |     out.close()
160 | 
161 |     subprocess.call([args['dif'], '-i', args['peptides'], '-samples', args['samples'], '-out',\
162 |      args['out'], '-normalize', args['norm'], '-impute_threshold', args['impute_threshold'], '-min_samples', args['min_samples']])
163 | 
164 | 
165 | 
166 | if __name__ == '__main__':
167 |     run()
168 | 


--------------------------------------------------------------------------------
/ms1searchpy/search.py:
--------------------------------------------------------------------------------
 1 | from . import main
 2 | import argparse
 3 | import logging
 4 | import os
 5 | 
 6 | def run():
 7 |     parser = argparse.ArgumentParser(
 8 |         description='Search proteins using LC-MS spectra',
 9 |         epilog='''
10 | 
11 |     Example usage
12 |     -------------
13 |     $ search.py input.mzML input2.mzML -d human.fasta -ad 1 -fdr 5.0
14 |     -------------
15 |     ''',
16 |         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
17 | 
18 |     parser.add_argument('files', help='input mzML or .tsv files with peptide features', nargs='+')
19 |     parser.add_argument('-d', '-db', help='path to protein fasta file', required=True)
20 |     parser.add_argument('-o', help='path to output folder', default='')
21 |     parser.add_argument('-ptol', help='precursor mass tolerance in ppm', default=10.0, type=float)
22 |     parser.add_argument('-fdr', help='protein fdr filter in %%', default=1.0, type=float)
23 |     parser.add_argument('-i', help='minimum number of isotopes', default=2, type=int)
24 |     parser.add_argument('-ci', help='minimum number of isotopes for mass and RT calibration', default=4, type=int)
25 |     parser.add_argument('-csc', help='minimum number of scans for mass and RT calibration', default=4, type=int)
26 |     parser.add_argument('-ts', help='Two-stage RT training: 0 - turn off, 1 - turn one, 2 - turn on and use additive model in the first stage (Default)', default=2, type=int)
27 |     parser.add_argument('-sc', help='minimum number of scans for peptide feature', default=2, type=int)
28 |     parser.add_argument('-lmin', help='min length of peptides', default=7, type=int)
29 |     parser.add_argument('-lmax', help='max length of peptides', default=30, type=int)
30 |     parser.add_argument('-e', help='cleavage rule in quotes!. X!Tandem style for cleavage rules: "[RK]|{P}" for trypsin,\
31 |      "[X]|[D]" for asp-n or "[RK]|{P},[K]|[X]" for mix of trypsin and lys-c', default='[RK]|{P}')
32 |     parser.add_argument('-mc', help='number of missed cleavages', default=0, type=int)
33 |     parser.add_argument('-cmin', help='min precursor charge', default=1, type=int)
34 |     parser.add_argument('-cmax', help='max precursor charge', default=4, type=int)
35 |     parser.add_argument('-fmods', help='fixed modifications. Use "[" and "]" for N-term and C-term amino acids. in psiname1@aminoacid1,psiname2@aminoacid2 format', default='Carbamidomethyl@C')
36 |     parser.add_argument('-fmods_legend', help='PSI Names for extra fixed modifications. Oxidation, Carbamidomethyl and TMT6plex are stored by default in source code. in psiname1@monomass1,psiname2@monomass2 format', default='')
37 |     parser.add_argument('-ad', help='add decoy', default=0, type=int)
38 |     parser.add_argument('-ml', help='use machine learning for PFMs', default=1, type=int)
39 |     parser.add_argument('-prefix', help='decoy prefix', default='DECOY_')
40 |     parser.add_argument('-sf', '--separate-figures', action='store_true', help='save figures as separate files')
41 |     parser.add_argument('-nproc',   help='number of processes', default=4, type=int)
42 |     parser.add_argument('-force_nproc', help='Force using multiprocessing for Windows', action='store_true')
43 |     parser.add_argument('-deeplc', help='use deeplc: 0 - turn off, 1 - turn on', default=0, type=int)
44 |     parser.add_argument('-deeplc_batch_num', help='batch_num for deeplc', default=100000, type=int)
45 |     parser.add_argument('-deeplc_model_path', help='path to deeplc model or folder with deeplc models', default='')
46 |     parser.add_argument('-deeplc_library', help='path to deeplc library', default='')
47 |     parser.add_argument('-pl', help='path to list of peptides for RT calibration', default='')
48 |     parser.add_argument('-mcalib', help='mass calibration: 2 - group by ion mobility and RT, 1 - by RT, 0 - no calibration', default=0, type=int)
49 |     parser.add_argument('-debug', help='Produce debugging output', action='store_true')
50 |     parser.add_argument('-save_calib', help='Save RT calibration list', action='store_true')
51 |     parser.add_argument('-check_unique', help='Experimental. check_unique', default=1, type=int)
52 |     parser.add_argument('-es', help='Experimental. Use extra stage for RT calibration', default=0, type=int)
53 |     parser.add_argument('-csd', help='Employ limited (1) or comlete (2) chrge-state distribution model; for complete model path to ThermoRawFileParser 1.4.2+ has to be provided. Default - (0) - don\'t use charge-state distribution' , default=0, type=int)
54 |     parser.add_argument('-trfp', help='Path to ThermoRawFileParser executable', default='')
55 | 
56 |     args = vars(parser.parse_args())
57 |     logging.basicConfig(format='%(levelname)9s: %(asctime)s %(message)s',
58 |             datefmt='[%H:%M:%S]', level=[logging.INFO, logging.DEBUG][args['debug']])
59 |     logging.getLogger('matplotlib.font_manager').disabled = True
60 |     logging.getLogger('matplotlib.category').disabled = True
61 |     logging.getLogger('matplotlib').setLevel(logging.WARNING)
62 |     logger = logging.getLogger(__name__)
63 | 
64 | 
65 |     if os.name == 'nt' and not args['force_nproc']:
66 |         logger.warning('Turning off multiprocessing for Windows system. Use -force_nproc option to turn it on')
67 |         args['nproc'] = 1
68 | 
69 |     logger.debug('Starting with args: %s', args)
70 |     main.process_file(args)
71 | 
72 | if __name__ == '__main__':
73 |     run()
74 | 


--------------------------------------------------------------------------------
/ms1searchpy/utils.py:
--------------------------------------------------------------------------------
  1 | from pyteomics import fasta, parser, mass
  2 | import os
  3 | from scipy.stats import binom
  4 | import numpy as np
  5 | import pandas as pd
  6 | import random
  7 | import itertools
  8 | from biosaur2 import main as bio_main
  9 | import logging
 10 | from copy import deepcopy
 11 | 
 12 | logger = logging.getLogger(__name__)
 13 | 
 14 | # Temporary for pyteomics <= Version 4.5.5 bug
 15 | if 'H-' in mass.std_aa_mass:
 16 |     del mass.std_aa_mass['H-']
 17 | if '-OH' in mass.std_aa_mass:
 18 |     del mass.std_aa_mass['-OH']
 19 | 
 20 | mods_custom_dict = {
 21 |     'Oxidation': 15.994915,
 22 |     'Carbamidomethyl': 57.021464,
 23 |     'TMT6plex': 229.162932,
 24 | }
 25 | 
 26 | 
 27 | def get_aa_mass_with_fixed_mods(fmods, fmods_legend):
 28 | 
 29 |     if fmods_legend:
 30 |         for mod in fmods_legend.split(','):
 31 |             psiname, m = mod.split('@')
 32 |             mods_custom_dict[psiname] = float(m)
 33 | 
 34 |     aa_mass = deepcopy(mass.std_aa_mass)
 35 |     aa_to_psi = dict()
 36 | 
 37 |     mass_h2o = mass.calculate_mass('H2O')
 38 |     for k in list(aa_mass.keys()):
 39 |         aa_mass[k] = round(mass.calculate_mass(sequence=k) - mass_h2o, 7)
 40 | 
 41 |     if fmods:
 42 |         for mod in fmods.split(','):
 43 |             psiname, aa = mod.split('@')
 44 |             if psiname not in mods_custom_dict:
 45 |                 logger.error('PSI Name for modification %s is missing in the modification legend' % (psiname, ))
 46 |                 raise Exception('Exception: missing PSI Name for modification')
 47 |             if aa == '[':
 48 |                 aa_mass['Nterm'] = float(mods_custom_dict[psiname])#float(m)
 49 |                 aa_to_psi['Nterm'] = psiname
 50 |             elif aa == ']':
 51 |                 aa_mass['Cterm'] = float(mods_custom_dict[psiname])#float(m)
 52 |                 aa_to_psi['Cterm'] = psiname
 53 |             else:
 54 |                 aa_mass[aa] += float(mods_custom_dict[psiname])#float(m)
 55 |                 aa_to_psi[aa] = psiname
 56 | 
 57 |     logger.debug(aa_mass)
 58 | 
 59 |     return aa_mass, aa_to_psi
 60 | 
 61 | 
 62 | def mods_for_deepLC(seq, aa_to_psi):
 63 |     if 'Nterm' in aa_to_psi:
 64 |         mods_list = ['0|%s' % (aa_to_psi['Nterm'], ), ]
 65 |     else:
 66 |         mods_list = []
 67 |     mods_list.extend([str(idx+1)+'|%s' % (aa_to_psi[aa]) for idx, aa in enumerate(seq) if aa in aa_to_psi])
 68 |     if 'Cterm' in aa_to_psi:
 69 |         mods_list.append(['-1|%s' % (aa_to_psi['Cterm'], ), ])
 70 |     return '|'.join(mods_list)
 71 | 
 72 | def recalc_spc(banned_dict, unstable_prots, prots_spc2):
 73 |     tmp = dict()
 74 |     for k in unstable_prots:
 75 |         tmp[k] = sum(banned_dict.get(l, 1) > 0 for l in prots_spc2[k])
 76 |     return tmp
 77 | 
 78 | def iterate_spectra(fname, min_ch, max_ch, min_isotopes, min_scans, nproc, check_unique=True):
 79 |     if os.path.splitext(fname)[-1].lower() == '.mzml':
 80 |         args = {
 81 |             'file': fname,
 82 |             'mini': 1,
 83 |             'minmz': 350,
 84 |             'maxmz': 1500,
 85 |             'pasefmini': 100,
 86 |             'htol': 8,
 87 |             'itol': 8,
 88 |             'paseftol': 0.05,
 89 |             'nm': 0,
 90 |             'o': '',
 91 |             'hvf': 1.3,
 92 |             'ivf': 5,
 93 |             'minlh': 2,
 94 |             'pasefminlh': 1,
 95 |             'nprocs': nproc,
 96 |             'cmin': 1,
 97 |             'cmax': 6,
 98 |             'dia': False,
 99 |             'diahtol': 25,
100 |             'diaminlh': 1,
101 |             'mgf': '',
102 |             'tof': False,
103 |             'profile': False,
104 |             'write_hills': False,
105 |             'debug': False  # actual debug value is set through logging, not here
106 |         }
107 |         bio_main.process_file(args)
108 |         fname = os.path.splitext(fname)[0] + '.features.tsv'
109 | 
110 |     df_features = pd.read_csv(fname, sep='\t')
111 | 
112 |     required_columns = [
113 |         'nIsotopes',
114 |         'nScans',
115 |         'charge',
116 |         'massCalib',
117 |         'rtApex',
118 |         'mz',
119 |         ]
120 | 
121 |     if not all(req_col in df_features.columns for req_col in required_columns):
122 |         logger.error('input feature file have missing columns: %s', ';'.join([req_col for req_col in required_columns if req_col not in df_features.columns]))
123 |         raise Exception('Exception: wrong columns in feature file')
124 |     logger.info('Total number of peptide isotopic clusters: %d', len(df_features))
125 | 
126 |     if 'id' not in df_features.columns:
127 |         df_features['id'] = df_features.index
128 |     if 'FAIMS' not in df_features.columns:
129 |         df_features['FAIMS'] = 0
130 |     if 'im' not in df_features.columns:
131 |         df_features['im'] = 0
132 | 
133 |     # if 'mz_std_1' in df_features.columns:
134 |     #     df_features['mz_diff_ppm_1'] = df_features.apply(lambda x: 1e6 * (x['mz'] - (x['mz_std_1'] - 1.00335 / x['charge'])) / x['mz'], axis=1)
135 |     #     df_features['mz_diff_ppm_2'] = -100
136 |     #     df_features.loc[df_features['intensity_2'] > 0, 'mz_diff_ppm_2'] = df_features.loc[df_features['intensity_2'] > 0, :].apply(lambda x: 1e6 * (x['mz'] - (x['mz_std_2'] - 2 * 1.00335 / x['charge'])) / x['mz'], axis=1)
137 | 
138 |     #     df_features['I-0-1'] = df_features.apply(lambda x: x['intensityApex'] / x['intensity_1'], axis=1)
139 |     #     df_features['I-0-2'] = -1
140 |     #     df_features.loc[df_features['intensity_2'] > 0, 'I-0-2'] = df_features.loc[df_features['intensity_2'] > 0, :].apply(lambda x: x['intensityApex'] / x['intensity_2'], axis=1)
141 | 
142 |     if check_unique:
143 |         # Check unique ids
144 |         if len(df_features['id']) != len(set(df_features['id'])):
145 |             df_features['id'] = df_features.index + 1
146 | 
147 |     # Remove features with low number of isotopes
148 |     df_features = df_features[df_features['nIsotopes'] >= min_isotopes]
149 | 
150 |     # Remove features with low number of Scans
151 |     df_features = df_features[df_features['nScans'] >= min_scans]
152 | 
153 |     # Remove features using min and max charges
154 |     df_features = df_features[df_features['charge'].apply(lambda x: min_ch <= x <= max_ch)]
155 | 
156 |     return df_features
157 | 
158 | def peptide_gen(args):
159 | 
160 | 
161 |     
162 |     prefix = args['prefix']
163 |     enzyme = get_enzyme(args['e'])
164 |     mc = args['mc']
165 |     minlen = args['lmin']
166 |     maxlen = args['lmax']
167 |     for prot in prot_gen(args):
168 |         for pep in prot_peptides(prot[1], enzyme, mc, minlen, maxlen, is_decoy=prot[0].startswith(prefix)):
169 |             yield pep
170 | 
171 | def get_enzyme(enzyme):
172 |     return convert_tandem_cleave_rule_to_regexp(enzyme)
173 |     # if enzyme in parser.expasy_rules:
174 |     #     return parser.expasy_rules.get(enzyme)
175 |     # else:
176 |     #     try:
177 |     #         enzyme = convert_tandem_cleave_rule_to_regexp(enzyme)
178 |     #         return enzyme
179 |     #     except:
180 |     #         return enzyme
181 | 
182 | def prot_gen(args):
183 |     db = args['d']
184 | 
185 |     with fasta.read(db) as f:
186 |         for p in f:
187 |             yield p
188 | 
189 | def prepare_decoy_db(args):
190 |     add_decoy = args['ad']
191 |     if add_decoy:
192 | 
193 |         prefix = args['prefix']
194 |         db = args['d']
195 |         out1, out2 = os.path.splitext(db)
196 |         out_db = out1 + '_shuffled' + out2
197 |         logger.info('Creating decoy database: %s', out_db)
198 | 
199 |         extra_check = False
200 |         if '{' in args['e']:
201 |             extra_check = True
202 |         if extra_check:
203 |             banned_pairs = set()
204 |             banned_aa = set()
205 |             for enzyme_local in args['e'].split(','):
206 |                 if '{' in enzyme_local:
207 |                     lpart, rpart = enzyme_local.split('|')
208 |                     for aa_left, aa_right in itertools.product(lpart[1:-1], rpart[1:-1]):
209 |                         banned_aa.add(aa_left)
210 |                         banned_aa.add(aa_right)
211 |                         banned_pairs.add(aa_left+aa_right)
212 | 
213 |             logger.debug(banned_aa)
214 |             logger.debug(banned_pairs)
215 | 
216 |         enzyme = get_enzyme(args['e'])
217 |         cleave_rule_custom = enzyme + '|' + '([BXZUO])'
218 |         # cleave_rule_custom = '([RKBXZUO])'
219 |         logger.debug(cleave_rule_custom)
220 | 
221 |         shuf_map = dict()
222 | 
223 |         prots = []
224 | 
225 |         for p in fasta.read(db):
226 |             if not p[0].startswith(prefix):
227 |                 target_peptides = [x[1] for x in parser.icleave(p[1], cleave_rule_custom, 0)]
228 | 
229 |                 checked_peptides = set()
230 |                 sample_list = []
231 |                 for idx, pep in enumerate(target_peptides):
232 | 
233 |                     if len(pep) > 2:
234 |                         pep_tmp = pep[1:-1]
235 |                         if extra_check:
236 |                             for bp in banned_pairs:
237 |                                 if bp in pep_tmp:
238 |                                     pep_tmp = pep_tmp.replace(bp, '')
239 |                                     checked_peptides.add(idx)
240 | 
241 | 
242 |                         sample_list.extend(pep_tmp)
243 |                 random.shuffle(sample_list)
244 |                 idx_for_shuffle = 0
245 | 
246 |                 decoy_peptides = []
247 |                 for idx, pep in enumerate(target_peptides):
248 | 
249 |                     if len(pep) > 2:
250 | 
251 |                         if pep in shuf_map:
252 |                             tmp_seq = shuf_map[pep]
253 |                         else:
254 |                             if not extra_check or idx not in checked_peptides:
255 |                                 tmp_seq = pep[0]
256 |                                 for pep_aa in pep[1:-1]:
257 |                                     tmp_seq += sample_list[idx_for_shuffle]
258 |                                     idx_for_shuffle += 1
259 |                                 tmp_seq += pep[-1]
260 |                             else:
261 |                                 max_l = len(pep)
262 |                                 tmp_seq = ''
263 |                                 ii = 0
264 |                                 while ii < max_l - 1:
265 |                                 # for ii in range(max_l-1):
266 |                                     if pep[ii] in banned_aa and pep[ii+1] in banned_aa and pep[ii] + pep[ii+1] in banned_pairs:
267 |                                         tmp_seq += pep[ii] + pep[ii+1]
268 |                                         ii += 1
269 |                                     else:
270 |                                         if ii == 0:
271 |                                             tmp_seq += pep[ii]
272 |                                         else:
273 |                                             tmp_seq += sample_list[idx_for_shuffle]
274 |                                             idx_for_shuffle += 1
275 | 
276 |                                     ii += 1
277 |                                 tmp_seq += pep[max_l-1]
278 | 
279 |                             shuf_map[pep] = tmp_seq
280 |                     else:
281 |                         tmp_seq = pep
282 | 
283 |                     decoy_peptides.append(tmp_seq)
284 | 
285 |                 assert len(target_peptides) == len(decoy_peptides)
286 | 
287 |                 prots.append((p[0], ''.join(target_peptides)))
288 |                 prots.append(('DECOY_' + p[0], ''.join(decoy_peptides)))
289 | 
290 |         fasta.write(prots, open(out_db, 'w')).close()
291 |         args['d'] = out_db
292 |         args['ad'] = 0
293 |     return args
294 | 
295 | seen_target = set()
296 | seen_decoy = set()
297 | def prot_peptides(prot_seq, enzyme, mc, minlen, maxlen, is_decoy, dont_use_seen_peptides=False):
298 | 
299 | 
300 |     dont_use_fast_valid = parser.fast_valid(prot_seq)
301 |     peptides = parser.cleave(prot_seq, enzyme, mc)
302 |     for pep in peptides:
303 |         plen = len(pep)
304 |         if minlen <= plen <= maxlen:
305 |             forms = []
306 |             if dont_use_fast_valid or pep in seen_target or pep in seen_decoy or parser.fast_valid(pep):
307 |                 if plen <= maxlen:
308 |                     forms.append(pep)
309 |             for f in forms:
310 |                 if dont_use_seen_peptides:
311 |                     yield f
312 |                 else:
313 |                     if f not in seen_target and f not in seen_decoy:
314 |                         if is_decoy:
315 |                             seen_decoy.add(f)
316 |                         else:
317 |                             seen_target.add(f)
318 |                         yield f
319 | 
320 | def get_prot_pept_map(args):
321 |     seen_target.clear()
322 |     seen_decoy.clear()
323 | 
324 | 
325 |     prefix = args['prefix']
326 |     enzyme = get_enzyme(args['e'])
327 |     mc = args['mc']
328 |     minlen = args['lmin']
329 |     maxlen = args['lmax']
330 | 
331 |     pept_prot = dict()
332 |     protsN = dict()
333 | 
334 |     target_prot_count = 0
335 |     decoy_prot_count = 0
336 |     target_peps = set()
337 |     decoy_peps = set()
338 | 
339 | 
340 | 
341 |     for desc, prot in prot_gen(args):
342 |         dbinfo = desc.split(' ')[0]
343 |         for pep in prot_peptides(prot, enzyme, mc, minlen, maxlen, desc.startswith(prefix), dont_use_seen_peptides=True):
344 |             pept_prot.setdefault(pep, set()).add(dbinfo)
345 |             protsN.setdefault(dbinfo, set()).add(pep)
346 |     for k, v in protsN.items():
347 |         if k.startswith(prefix):
348 |             decoy_prot_count += 1
349 |             decoy_peps.update(v)
350 |         else:
351 |             target_prot_count += 1
352 |             target_peps.update(v)
353 | 
354 |         protsN[k] = len(v)
355 | 
356 |     logger.info('Database information:')
357 |     logger.info('Target/Decoy proteins: %d/%d', target_prot_count, decoy_prot_count)
358 |     target_peps_number = len(target_peps)
359 |     decoy_peps_number = len(decoy_peps)
360 |     intersection_number = len(target_peps.intersection(decoy_peps)) / (target_peps_number + decoy_peps_number)
361 |     logger.info('Target/Decoy peptides: %d/%d', target_peps_number, decoy_peps_number)
362 |     logger.info('Target-Decoy peptide intersection: %.1f %%',
363 |         100 * intersection_number)
364 |        
365 |     ml_correction = decoy_peps_number * (1 - intersection_number) / target_peps_number * 0.5
366 |     del decoy_peps
367 |     del target_peps
368 |     return protsN, pept_prot, ml_correction
369 | 
370 | 
371 | def convert_tandem_cleave_rule_to_regexp(cleavage_rule):
372 | 
373 |     def get_sense(c_term_rule, n_term_rule):
374 |         if '{' in c_term_rule:
375 |             return 'N'
376 |         elif '{' in n_term_rule:
377 |             return 'C'
378 |         else:
379 |             if len(c_term_rule) <= len(n_term_rule):
380 |                 return 'C'
381 |             else:
382 |                 return 'N'
383 | 
384 |     def get_cut(cut, no_cut):
385 |         aminoacids = set(parser.std_amino_acids)
386 |         cut = ''.join(aminoacids & set(cut))
387 |         if '{' in no_cut:
388 |             no_cut = ''.join(aminoacids & set(no_cut))
389 |             return cut, no_cut
390 |         else:
391 |             no_cut = ''.join(set(parser.std_amino_acids) - set(no_cut))
392 |             return cut, no_cut
393 | 
394 |     out_rules = []
395 |     for protease in cleavage_rule.split(','):
396 |         protease = protease.replace('X', ''.join(parser.std_amino_acids))
397 |         c_term_rule, n_term_rule = protease.split('|')
398 |         sense = get_sense(c_term_rule, n_term_rule)
399 |         if sense == 'C':
400 |             cut, no_cut = get_cut(c_term_rule, n_term_rule)
401 |         else:
402 |             cut, no_cut = get_cut(n_term_rule, c_term_rule)
403 | 
404 |         if no_cut:
405 |             if sense == 'C':
406 |                 out_rules.append('([%s](?=[^%s]))' % (cut, no_cut))
407 |             else:
408 |                 out_rules.append('([^%s](?=[%s]))' % (no_cut, cut))
409 |         else:
410 |             if sense == 'C':
411 |                 out_rules.append('([%s])' % (cut, ))
412 |             else:
413 |                 out_rules.append('(?=[%s])' % (cut, ))
414 |     return '|'.join(out_rules)
415 | 
416 | 
417 | def keywithmaxval(d):
418 |      """ a) create a list of the dict's keys and values;
419 |          b) return the key with the max value"""
420 |      v=list(d.values())
421 |      k=list(d.keys())
422 |      return k[v.index(max(v))]
423 | 
424 | def calc_sf_all(v, n, p, prev_best_score=False):
425 |     sf_values = -np.log10(binom.sf(v-1, n, p))
426 |     sf_values[np.isnan(sf_values)] = 0
427 |     sf_values[np.isinf(sf_values)] = (prev_best_score if prev_best_score is not False else max(sf_values[~np.isinf(sf_values)]) * 2)
428 |     return sf_values
429 | 
430 | 


--------------------------------------------------------------------------------
/ms1searchpy/utils_figures.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division
  2 | from scipy.stats import scoreatpercentile
  3 | from pyteomics import mass, auxiliary as aux
  4 | from collections import Counter
  5 | import os.path
  6 | import numpy as np
  7 | import matplotlib
  8 | matplotlib.use('Agg')
  9 | import matplotlib.pyplot as plt
 10 | from matplotlib.patches import Patch
 11 | try:
 12 |     import seaborn
 13 |     seaborn.set(rc={'axes.facecolor':'#ffffff'})
 14 |     seaborn.set_style('whitegrid')
 15 | except ImportError:
 16 |     pass
 17 | import re
 18 | import sys
 19 | import logging
 20 | import pandas as pd
 21 | redcolor = '#FC6264'
 22 | bluecolor = '#70aed1'
 23 | greencolor = '#8AA413'
 24 | aa_color_1 = '#fca110'
 25 | aa_color_2 = '#a41389'
 26 | 
 27 | 
 28 | def _get_sf(fig):
 29 |     return isinstance(fig, str)
 30 | 
 31 | 
 32 | def get_basic_distributions(df):
 33 |     if 'mz' in df.columns:
 34 |         mz_array = df['mz'].values
 35 |         rt_exp_array = df['rtApex'].values
 36 |         intensity_array = np.log10(df['intensityApex'].values)
 37 |         if 'FAIMS' in df.columns:
 38 |             faims_array = df['FAIMS'].replace('None', 0).values
 39 |         else:
 40 |             faims_array = np.zeros(len(df))
 41 |         mass_diff_array = []
 42 |         RT_diff_array = []
 43 |     else:
 44 |         mz_array = df['m/z'].values
 45 |         rt_exp_array = df['RT'].values
 46 |         intensity_array = np.log10(df['Intensity'].values)
 47 |         faims_array = df['ion_mobility'].values
 48 |         mass_diff_array = df['mass diff'].values
 49 |         RT_diff_array = df['RT diff'].values
 50 |     charges_array = df['charge'].values
 51 |     nScans_array = df['nScans'].values
 52 |     nIsotopes_array = df['nIsotopes'].values
 53 |     return mz_array, rt_exp_array, charges_array, intensity_array, nScans_array, nIsotopes_array, faims_array, mass_diff_array, RT_diff_array
 54 | 
 55 | 
 56 | def get_descriptor_array(df, df_f, dname):
 57 |     array_t = df[~df['decoy']][dname].values
 58 |     array_d = df[df['decoy']][dname].values
 59 |     array_v = df_f[dname].values
 60 |     return array_t, array_d, array_v
 61 | 
 62 | 
 63 | def plot_hist_basic(array_all, fig, subplot_max_x, subplot_i,
 64 |         xlabel, ylabel='# of identifications', idtype='features', bin_size_one=False):
 65 |     separate_figures = _get_sf(fig)
 66 |     if separate_figures:
 67 |         plt.figure()
 68 |     else:
 69 |         fig.add_subplot(subplot_max_x, 3, subplot_i)
 70 |     cbins = get_bins((array_all, ), bin_size_one)
 71 |     if idtype == 'features':
 72 |         plt.hist(array_all, bins=cbins, color=redcolor, alpha=0.8, edgecolor='#EEEEEE')
 73 |     else:
 74 |         plt.hist(array_all, bins=cbins, color=greencolor, alpha=0.8, edgecolor='#EEEEEE')
 75 |     plt.ylabel(ylabel)
 76 |     plt.xlabel(xlabel)
 77 |     if separate_figures:
 78 |         plt.savefig(outpath(fig, xlabel, '.png'))
 79 |         plt.close()
 80 | 
 81 | def plot_basic_figures(df, fig, subplot_max_x, subplot_start, idtype):
 82 |     mz_array, rt_exp_array, lengths_array, intensity_array, nScans_array, nIsotopes_array, faims_array, mass_diff_array, RT_diff_array = get_basic_distributions(df)
 83 | 
 84 |     plot_hist_basic(mz_array, fig, subplot_max_x, subplot_i=subplot_start,
 85 |                      xlabel='%s, precursor m/z' % (idtype, ), idtype=idtype)
 86 |     subplot_start += 1
 87 |     plot_hist_basic(rt_exp_array, fig, subplot_max_x, subplot_i=subplot_start,
 88 |                      xlabel='%s, RT experimental' % (idtype, ), idtype=idtype)
 89 |     subplot_start += 1
 90 |     plot_hist_basic(lengths_array, fig, subplot_max_x, subplot_i=subplot_start,
 91 |                      xlabel='%s, charge' % (idtype, ), idtype=idtype, bin_size_one=True)
 92 |     subplot_start += 1
 93 |     plot_hist_basic(intensity_array, fig, subplot_max_x, subplot_i=subplot_start,
 94 |                      xlabel='%s, log10(Intensity)' % (idtype, ), idtype=idtype)
 95 |     subplot_start += 1
 96 |     plot_hist_basic(nScans_array, fig, subplot_max_x, subplot_i=subplot_start,
 97 |                      xlabel='%s, nScans' % (idtype, ), idtype=idtype, bin_size_one=True)
 98 |     subplot_start += 1
 99 |     plot_hist_basic(nIsotopes_array, fig, subplot_max_x, subplot_i=subplot_start,
100 |                      xlabel='%s, nIsotopes' % (idtype, ), idtype=idtype, bin_size_one=True)
101 |     subplot_start += 1
102 |     plot_hist_basic(faims_array, fig, subplot_max_x, subplot_i=subplot_start,
103 |                      xlabel='%s, ion mobility' % (idtype, ), idtype=idtype)
104 |     if idtype=='peptides':
105 |         subplot_start += 1
106 |         plot_hist_basic(mass_diff_array, fig, subplot_max_x, subplot_i=subplot_start,
107 |                         xlabel='%s, mass diff ppm' % (idtype, ), idtype=idtype)
108 |         subplot_start += 1
109 |         plot_hist_basic(RT_diff_array, fig, subplot_max_x, subplot_i=subplot_start,
110 |                         xlabel='%s, RT diff min' % (idtype, ), idtype=idtype)
111 | 
112 | 
113 | def plot_protein_figures(df, df_f, fig, subplot_max_x, subplot_start, prefix):
114 |     plot_hist_descriptor(get_descriptor_array(df, df_f, dname='sq'), fig, subplot_max_x, subplot_start, xlabel='proteins, sequence coverage', ylabel='# of identifications', only_true=True)
115 |     subplot_start += 1
116 |     plot_hist_descriptor(get_descriptor_array(df, df_f, dname='matched peptides'), fig, subplot_max_x, subplot_start, xlabel='proteins, matched peptides', ylabel='# of identifications', only_true=True, bin_size_one=True)
117 |     subplot_start += 1
118 |     plot_hist_descriptor(get_descriptor_array(df, df_f, dname='corrected sq'), fig, subplot_max_x, subplot_start, xlabel='proteins, corrected sequence coverage', ylabel='# of identifications', only_true=True)
119 |     subplot_start += 1
120 |     plot_hist_descriptor(get_descriptor_array(df, df_f, dname='corrected matched peptides'), fig, subplot_max_x, subplot_start, xlabel='proteins, corrected matched peptides', ylabel='# of identifications', only_true=True, bin_size_one=True)
121 |     subplot_start += 1
122 |     plot_hist_descriptor(get_descriptor_array(df, df_f, dname='score'), fig, subplot_max_x, subplot_start, xlabel='proteins, score', ylabel='# of identifications', only_true=False)
123 |     subplot_start += 1
124 |     plot_qvalues(df, fig, subplot_max_x, subplot_start, prefix)
125 |     subplot_start += 1
126 |     return subplot_start
127 | 
128 | 
129 | def plot_hist_descriptor(inarrays, fig, subplot_max_x, subplot_i, xlabel, ylabel='# of identifications', only_true=False, bin_size_one=False):
130 |     separate_figures = _get_sf(fig)
131 |     if separate_figures:
132 |         plt.figure()
133 |     else:
134 |         fig.add_subplot(subplot_max_x, 3, subplot_i)
135 |     array_t, array_d, array_v = inarrays
136 |     if xlabel == 'proteins, score':
137 |         logscale=True
138 |     else:
139 |         logscale=False
140 |     if only_true:
141 |         cbins, width = get_bins_for_descriptors([array_v, ], bin_size_one=bin_size_one)
142 |         H3, _ = np.histogram(array_v, bins=cbins)
143 |         plt.bar(cbins[:-1], H3, width, align='center',color=greencolor, alpha=1, edgecolor='#EEEEEE', log=logscale)
144 |     else:
145 |         cbins, width = get_bins_for_descriptors(inarrays)
146 |         H1, _ = np.histogram(array_d, bins=cbins)
147 |         H2, _ = np.histogram(array_t, bins=cbins)
148 |         H3, _ = np.histogram(array_v, bins=cbins)
149 |         plt.bar(cbins[:-1], H1, width, align='center',color=redcolor, alpha=0.4, edgecolor='#EEEEEE', log=logscale)
150 |         plt.bar(cbins[:-1], H2, width, align='center',color=bluecolor, alpha=0.4, edgecolor='#EEEEEE', log=logscale)
151 |         plt.bar(cbins[:-1], H3, width, align='center',color=greencolor, alpha=1, edgecolor='#EEEEEE', log=logscale)
152 |         cbins = np.append(cbins[0], cbins)
153 |         H1 = np.append(np.append(0, H1), 0)
154 |         H2 = np.append(np.append(0, H2), 0)
155 |         cbins -= width / 2
156 |         plt.step(cbins, H2, where='post', color=bluecolor, alpha=0.8)
157 |         plt.step(cbins, H1, where='post', color=redcolor, alpha=0.8)
158 | 
159 | 
160 | 
161 |     plt.ylabel(ylabel)
162 |     plt.xlabel(xlabel)
163 |     if width == 1.0:
164 |         plt.xticks(np.arange(int(cbins[0]), cbins[-1], 1))
165 |         plt.gcf().canvas.draw()
166 |     if separate_figures:
167 |         plt.savefig(outpath(fig, xlabel, '.png'))
168 |         plt.close()
169 | 
170 | def plot_qvalues(df, fig, subplot_max_x, subplot_i, prefix):
171 |     separate_figures = _get_sf(fig)
172 |     df1 = df.copy()
173 |     if separate_figures:
174 |         plt.figure()
175 |     else:
176 |         fig.add_subplot(subplot_max_x, 3, subplot_i)
177 |     df1['shortname'] = df1['dbname'].apply(lambda x: x.replace(prefix, ''))
178 |     df1 = df1.sort_values(by='score', ascending=False)
179 |     df1 = df1.drop_duplicates(subset='shortname')
180 |     df1 = df1[df1['score'] > 0]
181 |     qar = aux.qvalues(df1, key='score', is_decoy='decoy', reverse=True, remove_decoy=True, correction=1)['q']
182 |     if sum(qar<=0.01) == 0:
183 |         qar = aux.qvalues(df1, key='score', is_decoy='decoy', reverse=True, remove_decoy=True, correction=0)['q']
184 | 
185 |     plt.plot(qar*100, np.arange(1, len(qar)+1, 1))
186 |     plt.ylabel('# identified proteins')
187 |     plt.xlabel('FDR, %')
188 |     plt.vlines(1, 0, len(qar), linestyle='--', color='b')
189 |     plt.text(2, len(qar), '1%% FDR: %d proteins' % (sum(qar<=0.01)), )
190 |     plt.vlines(5, 0, len(qar)*2/3, linestyle='--', color='b')
191 |     plt.text(6, len(qar)*2/3, '5%% FDR: %d proteins' % (sum(qar<=0.05)), )
192 |     plt.vlines(15, 0, len(qar)/3, linestyle='--', color='b')
193 |     plt.text(15, len(qar)/3, '15%% FDR: %d proteins' % (sum(qar<=0.15)), )
194 |     if separate_figures:
195 |         plt.savefig(outpath(fig, 'FDR, %', '.png'))
196 |         plt.close()
197 | 
198 | 
199 | def plot_legend(fig, subplot_max_x, subplot_start):
200 |     ax = fig.add_subplot(subplot_max_x, 3, subplot_start)
201 |     legend_elements = [Patch(facecolor=greencolor, label='Positive IDs'),
202 |                     Patch(facecolor=bluecolor, label='Targets'),
203 |                     Patch(facecolor=redcolor, label='Decoys')]
204 |     ax.legend(handles=legend_elements, loc='center', prop={'size': 24})
205 |     ax.set_axis_off()
206 | 
207 | 
208 | def plot_aa_stats(df_f, df, fig, subplot_max_x, subplot_i):
209 |     separate_figures = _get_sf(fig)
210 |     if separate_figures:
211 |         plt.figure()
212 |     else:
213 |         fig.add_subplot(subplot_max_x, 3, subplot_i)
214 | 
215 |     # Generate list of 20 standart amino acids
216 |     std_aa_list = list(mass.std_aa_mass.keys())
217 |     std_aa_list.remove('O')
218 |     std_aa_list.remove('U')
219 | 
220 |     # Count identified amino acids
221 |     aa_exp = Counter()
222 |     for pep in set(df_f['sequence']):
223 |         for aa in pep:
224 |             aa_exp[aa] += 1
225 | 
226 |     # Count theoretical amino acids
227 |     aa_theor = Counter()
228 |     for pep in set(df[df['decoy']]['sequence']):
229 |         for aa in pep:
230 |             aa_theor[aa] += 1
231 | 
232 |     aa_exp_sum = sum(aa_exp.values())
233 |     aa_theor_sum = sum(aa_theor.values())
234 |     lbls, vals = [], []
235 |     for aa in sorted(std_aa_list):
236 |         if aa_theor.get(aa, 0):
237 |             lbls.append(aa)
238 |             vals.append((aa_exp.get(aa, 0)/aa_exp_sum) / (aa_theor[aa]/aa_theor_sum))
239 |     clrs = [aa_color_1 if abs(x-1) < 0.4 else aa_color_2 for x in vals]
240 |     plt.bar(range(len(vals)), vals, color=clrs)
241 |     plt.xticks(range(len(lbls)), lbls)
242 |     plt.hlines(1.0, range(len(vals))[0]-1, range(len(vals))[-1]+1)
243 |     plt.ylabel('amino acid ID rate')
244 |     if separate_figures:
245 |         plt.savefig(outpath(fig, 'amino acid ID rate', '.png'))
246 |         plt.close()
247 | 
248 | 
249 | def calc_max_x_value(df, df_proteins):
250 |     cnt = 24 # number of basic figures
251 |     peptide_columns = set(df.columns)
252 |     features_list = []
253 |     for feature in features_list:
254 |         if feature in peptide_columns:
255 |             cnt += 1
256 |     return cnt // 3 + (1 if (cnt % 3) else 0)
257 | 
258 | 
259 | def plot_descriptors_figures(df, df_f, fig, subplot_max_x, subplot_start):
260 |     plot_hist_descriptor(get_descriptor_array(df, df_f, dname='mass diff'), fig, subplot_max_x, subplot_start, xlabel='precursor mass difference, ppm')
261 |     subplot_start += 1
262 |     plot_hist_descriptor(get_descriptor_array(df, df_f, dname='RT diff'), fig, subplot_max_x, subplot_start, xlabel='RT difference, min')
263 |     subplot_start += 1
264 |     separate_figures = _get_sf(fig)
265 |     if not separate_figures:
266 |         plot_legend(fig, subplot_max_x, subplot_start)
267 |     subplot_start += 1
268 | 
269 | 
270 | def get_bins(inarrays, bin_size_one=False):
271 |     tmp = np.concatenate(inarrays)
272 |     minv = tmp.min()
273 |     maxv = tmp.max()
274 |     if bin_size_one:
275 |         return np.arange(minv, maxv+1, 1)
276 |     else:
277 |         return np.linspace(minv, maxv+1, num=100)
278 | 
279 | 
280 | def get_bins_for_descriptors(inarrays, bin_size_one=False):
281 |     tmp = np.concatenate(inarrays)
282 |     minv = tmp.min()
283 |     maxv = tmp.max()
284 |     if len(set(tmp)) <= 15:
285 |         return np.arange(minv, maxv + 2, 1.0), 1.0
286 |     binsize = False
287 |     for inar in inarrays:
288 |         binsize_tmp = get_fdbinsize(inar)
289 |         if not binsize or binsize > binsize_tmp:
290 |             binsize = binsize_tmp
291 |     # binsize = get_fdbinsize(tmp)
292 |     if binsize < float(maxv - minv) / 300:
293 |         binsize = float(maxv - minv) / 300
294 | 
295 |     if bin_size_one:
296 |         binsize = int(binsize)
297 |         if binsize == 0:
298 |             binsize = 1
299 | 
300 |     lbin_s = scoreatpercentile(tmp, 1.0)
301 |     lbin = minv
302 |     if lbin_s and abs((lbin - lbin_s) / lbin_s) > 1.0:
303 |         lbin = lbin_s * 1.05
304 |     rbin_s = scoreatpercentile(tmp, 99.0)
305 |     rbin = maxv
306 |     if rbin_s and abs((rbin - rbin_s) / rbin_s) > 1.0:
307 |         rbin = rbin_s * 1.05
308 |     rbin += 1.5 * binsize
309 |     return np.arange(lbin, rbin + binsize, binsize), binsize
310 | 
311 | 
312 | def get_fdbinsize(data_list):
313 |     """Calculates the Freedman-Diaconis bin size for
314 |     a data set for use in making a histogram
315 |     Arguments:
316 |     data_list:  1D Data set
317 |     Returns:
318 |     optimal_bin_size:  F-D bin size
319 |     """
320 |     if not isinstance(data_list, np.ndarray):
321 |         data_list = np.array(data_list)
322 |     isnan = np.isnan(data_list)
323 |     data_list = np.sort(data_list[~isnan])
324 |     upperquartile = scoreatpercentile(data_list, 75)
325 |     lowerquartile = scoreatpercentile(data_list, 25)
326 |     iqr = upperquartile - lowerquartile
327 |     optimal_bin_size = 2. * iqr / len(data_list) ** (1. / 3.)
328 |     MINBIN = 1e-8
329 |     if optimal_bin_size < MINBIN:
330 |         return MINBIN
331 |     return optimal_bin_size
332 | 
333 | 
334 | def normalize_fname(s):
335 |     return re.sub(r'[<>:\|/?*]', '', s)
336 | 
337 | 
338 | def outpath(outfolder, s, ext='.png'):
339 |     return os.path.join(outfolder, normalize_fname(s) + ext)
340 | 
341 | 
342 | def plot_outfigures(df, df_peptides, df_peptides_f, base_out_name, df_proteins, df_proteins_f, prefix='DECOY_', separate_figures=False):
343 |     if not separate_figures:
344 |         fig = plt.figure(figsize=(16, 12))
345 |         dpi = fig.get_dpi()
346 |         fig.set_size_inches(3000.0/dpi, 3000.0/dpi)
347 |     else:
348 |         outfolder = os.path.join(base_out_name + '_figures')
349 |         if not os.path.isdir(outfolder):
350 |             os.makedirs(outfolder)
351 |         fig = outfolder
352 |     subplot_max_x = calc_max_x_value(df, df_proteins)
353 |     descriptor_start_index = 20
354 |     plot_basic_figures(df, fig, subplot_max_x, 1, 'features')
355 |     plot_basic_figures(df_peptides_f, fig, subplot_max_x, 8, 'peptides')
356 |     df_proteins['sq'] = df_proteins['matched peptides'] / df_proteins['theoretical peptides'] * 100
357 |     df_proteins_f['sq'] = df_proteins_f['matched peptides'] / df_proteins_f['theoretical peptides'] * 100
358 | 
359 |     p_decoy = df_proteins[df_proteins['decoy']]['matched peptides'].sum() / df_proteins[df_proteins['decoy']]['theoretical peptides'].sum()
360 |     df_proteins['corrected matched peptides'] = df_proteins['matched peptides'] - p_decoy * df_proteins['theoretical peptides']
361 |     df_proteins_f['corrected matched peptides'] = df_proteins_f['matched peptides'] - p_decoy * df_proteins_f['theoretical peptides']
362 |     df_proteins['corrected sq'] = df_proteins['corrected matched peptides'] / df_proteins['theoretical peptides'] * 100
363 |     df_proteins_f['corrected sq'] = df_proteins_f['corrected matched peptides'] / df_proteins_f['theoretical peptides'] * 100
364 | 
365 |     subplot_current = plot_protein_figures(df_proteins, df_proteins_f, fig, subplot_max_x, 17, prefix)
366 |     plot_aa_stats(df_peptides_f, df_peptides, fig, subplot_max_x, subplot_current)
367 |     plt.grid(color='#EEEEEE')
368 |     plt.tight_layout()
369 |     if not separate_figures:
370 |         plt.savefig(base_out_name + '.png')
371 |         plt.close()
372 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | pyteomics>=4.5.1
 2 | lxml
 3 | scipy
 4 | numpy
 5 | scikit-learn
 6 | lightgbm
 7 | pandas
 8 | biosaur2
 9 | matplotlib
10 | seaborn
11 | ete3


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | '''
 4 | setup.py file for ms1searchpy
 5 | '''
 6 | 
 7 | from setuptools import setup, find_packages
 8 | 
 9 | version = open('VERSION').readline().strip()
10 | 
11 | setup(
12 |     name                 = 'ms1searchpy',
13 |     version              = version,
14 |     description          = '''A proteomics search engine for LC-MS1 spectra.''',
15 |     long_description     = (''.join(open('README.md').readlines())),
16 |     long_description_content_type = 'text/markdown',
17 |     author               = 'Mark Ivanov',
18 |     author_email         = 'pyteomics@googlegroups.com',
19 |     install_requires     = [line.strip() for line in open('requirements.txt')],
20 |     classifiers          = ['Intended Audience :: Science/Research',
21 |                             'Programming Language :: Python :: 3',
22 |                             'Topic :: Education',
23 |                             'Topic :: Scientific/Engineering :: Bio-Informatics',
24 |                             'Topic :: Scientific/Engineering :: Chemistry',
25 |                             'Topic :: Scientific/Engineering :: Physics'],
26 |     license              = 'License :: OSI Approved :: Apache Software License',
27 |     packages             = find_packages(),
28 |     entry_points         = {'console_scripts': ['ms1searchpy = ms1searchpy.search:run',
29 |                                                 'ms1combine = ms1searchpy.combine:run',
30 |                                                 'ms1groups = ms1searchpy.group_specific:run',
31 |                                                 'ms1combine_proteins = ms1searchpy.combine_proteins:run',
32 |                                                 'directms1quant = ms1searchpy.directms1quant:run',
33 |                                                 'directms1quantmulti = ms1searchpy.directms1quantmulti:run',
34 |                                                 'directms1quantDIA = ms1searchpy.directms1quantDIA:run',
35 |                                                 'directms1quantneg = ms1searchpy.directms1quantneg:run',
36 |                                                 'ms1quant = ms1searchpy.directms1quant:run',
37 |                                                 'ms1todiffacto = ms1searchpy.ms1todiffacto:run',
38 |                                                 ]}
39 |     )
40 | 


--------------------------------------------------------------------------------
/test.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | TESTDIR="test_data"
 4 | if [ "$1" = "--no-deeplc" ]; then
 5 |     NODEEPLC=true
 6 | else
 7 |     NODEEPLC=false
 8 | fi
 9 | 
10 | if [ ! -d "$TESTDIR" ]; then
11 |     echo "You must have the test dataset to run this test."
12 |     exit 1
13 | fi
14 | 
15 | cd "$TESTDIR"
16 | rm -vf *.features* *.tsv *.txt
17 | 
18 | echo ""
19 | echo "Starting ms1searchpy ..."
20 | echo "------------------------"
21 | ms1command="time ms1searchpy -d sprot_ecoli_ups.fasta -ad 1 -nproc 6 -debug "
22 | if ! $NODEEPLC; then ms1command+="-deeplc deeplc -deeplc_library deeplc.lib "; fi
23 | ms1command+="*.mzML"
24 | echo "$ms1command"
25 | eval "$ms1command" && echo "DirectMS1 run successful."
26 | 
27 | echo ""
28 | echo "Starting ms1combine ..."
29 | echo "-----------------------"
30 | time ms1combine *_UPS_4_0?.features_PFMs_ML.tsv -out UPS_4 && echo "ms1combine run successful."
31 | time ms1combine *_UPS_2_0?.features_PFMs_ML.tsv -out UPS_2 && echo "ms1combine run successful."
32 | 
33 | echo ""
34 | echo "Starting ms1todiffacto ..."
35 | echo "--------------------------"
36 | time ms1todiffacto -dif diffacto -S1 *_UPS_4_0?.features_proteins.tsv -S2 *_UPS_2_0?.features_proteins.tsv \
37 |     -norm median -out diffacto_output.tsv -min_samples 3 -debug && echo "ms1todiffacto run successful."
38 | 
39 | echo ""
40 | echo "Starting directms1quant ..."
41 | echo "---------------------------"
42 | time directms1quant -S1 *_UPS_4_0?.features_proteins_full.tsv -S2 *_UPS_2_0?.features_proteins_full.tsv -min_samples 3 \
43 |     && echo "DirectMS1quant run successful."
44 | 


--------------------------------------------------------------------------------